Commit Graph

37258 Commits

Author SHA1 Message Date
gf2121 67be0189bc
clean up sleep (#12914) 2024-01-08 13:48:26 +01:00
Adrien Grand 40060f8b70
Reduce contention on flushControl.isFullFlush(). (#12958)
`flushControl.isFullFlush()` is a surprising source of contention with
documents that are cheap to index and many indexing threads. If I slightly
modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing
buffer and disable `TextField` fields, which are more costly to index than
`KeywordField` or `IntField` fields, this brings the time to load all the
dataset in the `IndexWriter` buffers from 8.0s to 7.0s.
2024-01-08 13:23:05 +01:00
Stefan Vodita 115a30d462
Increase stale PRs actionbudget and mark not debug-only (#12998) 2024-01-08 07:20:59 -05:00
Stefan Vodita 564b2ebecc
Introduce workflow for stale PRs (#12813)
* Introduce stale workflow

* Exempt draft PRs

* Tune the action to our needs

1. Don't mark issues stale, only PRs.
2. Don't close anything automatically.
3. Keep the default Stale label.
4. Run in debug-only mode to start.
2024-01-08 06:22:19 -05:00
Dzung Bui 4c883a414c
Optimize FST on-heap BytesReader (#12879)
* Move size() to FSTStore

* Remove size() completely

* Allow FST builder to use different DataOutput

* access BytesStore byte[] directly for copying

* Rename BytesStore

* Change class to final

* Reorder methods

* Remove unused methods

* Rename truncate to setPosition() and remove skipBytes()

* Simplify the writing operations

* Update comment

* remove unused parameter

* Simplify BytesStore operation

* tidy code

* Rename copyBytes to writeTo

* Simplify BytesStore operations

* Embed writeBytes() to FSTCompiler

* Fix the write bytes method

* Remove the default block bits constant

* add assertion

* Rename method parameter names

* Move reverse to FSTCompiler

* Revert setPosition call

* Address comments

* Return immediately when writing 0 bytes

* Add comment &

* Rename variables

* Fix the compile error

* Remove isReadable()

* Remove isReadable()

* Optimize ReadWriteDataOutput

* tidy code

* Freeze the DataOutput once finished()

* Refactor

* freeze the DataOutput before use

* Improvement of ReadWriteDataOutput

* tidy code

* Address comments and add off-heap FST tests

* Remove the hardcoded random

* Ignore the Test2BFSTOffHeap test

* Simplify ReadWriteDataOutput

* Do not expose blockBits

* tidy code

* Remove 0 initialization

* Add assertion and comment
2024-01-06 07:47:19 -05:00
sabi0 7b8aece125
Use Collections.addAll() instead of manual array copy and misc. code cleanups (#12977) 2024-01-04 22:27:36 +01:00
sabi0 1a939410dd
Misc code cleanups (#12974) 2024-01-04 08:37:49 +01:00
Kaival Parikh 248f067d52
Reduce number of dimensions for Test[Byte|Float]VectorSimilarityQuery (#12988)
### Description

Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph

This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)) (similar test for KNN with random vectors)

### Command to reproduce

```
./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361
```
2024-01-02 13:06:12 -05:00
sabi0 78b4f75a2c
Replace .collect(toList()) with .toList() and misc. code cleanups (#12978) 2023-12-30 17:04:11 +01:00
sabi0 ec9e593dc4
Remove obsolete 'mappingRules' in Tokenizer tests (#12972) 2023-12-30 16:59:59 +01:00
sabi0 67d866c586
Minor code cleanups (intellij inspections). 2023-12-30 16:55:49 +01:00
Uwe Schindler 346f4ff7d2 Move changes entry to 9.10 (#12841) 2023-12-29 13:08:49 +01:00
sabi0 64cf54a4bf
Replace "UTF-8" with StandardCharsets.UTF_8 and other typo and minor cleanups (#12979) 2023-12-28 19:42:22 +01:00
sabi0 91272f45da
Replace println(String.format(...)) with printf(...) (#12976) 2023-12-28 19:32:06 +01:00
sabi0 57b104e806
Get rid of inefficient Stream.count() (#12975) 2023-12-28 19:30:01 +01:00
sabi0 9c9949b2bc
Remove unused imports (#12970) 2023-12-28 19:28:24 +01:00
Patrick Zhai 948970be58
Fix bug where NFARunAutomaton#getTransition does not set Transition correctly (#12909) 2023-12-27 22:49:35 -08:00
sabi0 02722eeb69
Add missing spaces in concatenated strings (#12967) 2023-12-23 20:30:30 -05:00
Zhang Chao dc9f154aa5
Move group-varint encoding/decoding logic to DataOutput/DataInput (#12841) 2023-12-23 13:18:34 +01:00
sabi0 9359a9dcff
Update contributing guide: autocrlf and build dependencies (#12963) 2023-12-22 09:28:53 +01:00
sabi0 f6b2006195
Fix typo in help/formatting.txt (#12960) 2023-12-21 19:58:53 +01:00
Adrien Grand 91002d04d3 Fix CheckIndex to correctly flag the automaton as binary. 2023-12-20 14:39:32 +01:00
Zhang Chao 5152051f68
Improve Javadoc for DocValuesConsumer (#12952) 2023-12-20 13:40:44 +01:00
Adrien Grand bcc7e120ba
Modernize LineFileDocs. (#12929)
This replaces `StringField`/`SortedDocValuesField` with `KeywordField` and
`IntPoint`/`NumericDocValuesField` with `IntField`.
2023-12-19 11:25:26 +01:00
Adrien Grand 5c084fcd6e
Add a stored fields test that indexes LineFileDocs. (#12927)
Real-world data exhibits patterns that are taken advantage of by the
compression logic, but also hardly reproducible in a randomized way. This makes
this new test introduce interesting coverage.

It takes one second to run on my machine, so I did not mark it `@Nightly`.
2023-12-19 11:20:14 +01:00
Adrien Grand bf45ab79ec
Beef up `Terms#intersect` checks in `CheckIndex`. (#12926)
Now also testing what happens with a non-null `startTerm`. This found bugs in
`DirectPostingsFormat`.
2023-12-19 11:17:38 +01:00
Lukáš Vlček 5d6086e199
Fix position increment in (Reverse)PathHierarchyTokenizer (#12875)
* Fix PathHierarchyTokenizer positions

PathHierarchyTokenizer was emitting multiple tokens in the same position
with changing offsets. To be consistent with EdgeNGramTokenizer (which
is conceptually similar -- it's emitting multiple prefixes/suffixes off
the input string), we can output every token with length 1 with
positions incrementing by 1.

* Fix ReversePathHierarchyTokenizer positions

Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer.

---------

Co-authored-by: Michael Froh <froh@amazon.com>
2023-12-18 08:48:22 -05:00
Dawid Weiss 6bb244a932
An improved check for ignoring the c2-crash test if running on a client compiler. (#12953) 2023-12-18 12:37:57 +01:00
ChrisHegarty f6582ce048 Add back-compat indices for 9.9.1 2023-12-17 09:39:46 +00:00
ChrisHegarty 08728bf202 Add bugfix version 9.9.1 2023-12-17 09:20:34 +00:00
ChrisHegarty 1f1d0735c8 DOAP changes for release 9.9.1 2023-12-16 22:55:20 +00:00
Michael Sokolov 49d521145d
Use hppc IntIntHashMap to avoid Integer box/unbox when remapping vector ordinals during merge (#12950) 2023-12-15 13:24:05 -05:00
Benjamin Trent 423f8279f0
Fix flaky tests that are caused by small float vectors (#12943)
While quantization generally works well, when the number of dimensions is tiny (just two like in our tests), and we are indexing a circle, and we have random merge policies, we can end up getting unexpected ordering on the resulting vectors.

closes: https://github.com/apache/lucene/issues/12940
2023-12-14 14:38:22 -05:00
Michael McCandless d1551da027 #12932: get monsters tests compiling/running again (#12942) 2023-12-14 10:14:45 -05:00
Stefan Vodita b0ebb849f5
Introduce growInRange to reduce array overallocation (#12844)
In cases where we know there is an upper limit to the potential size
of an array, we can use `growInRange` to avoid allocating beyond that
limit.
2023-12-14 23:00:26 +09:00
Michael McCandless ebf9e29570
Ensure Nori/Kuromoji shipped binary FST is the latest version (#12933)
* ensure Nori/Kuromoji shipped binary FST is the latest version (closes #12911)

* fold feedback from @uschindler: sharpen test failure methods to give the specific gradlew command to regenerate the precise FST (not everything)

* add javadoc for FSTMetadata.getVersion
2023-12-14 07:38:34 -05:00
Jakub Slowinski 3965319441
Attempting to clean up some remaining Solr references (#12939)
* Attempting to clean up some remaining Solr references

* Update gradle/help.gradle

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>

---------

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
2023-12-14 06:02:16 -05:00
Patrick Zhai da69346257 Add CHANGES.txt entry for #12910 2023-12-14 09:14:18 +09:00
Patrick Zhai f303d29baf
Refactor around NeighborArray (#12910) 2023-12-14 09:03:44 +09:00
Uwe Schindler 16d0b822b3
Prevent the common zero-width code points and detect invalid UTF-8 encoding in our sources and selected resource files (#12937)
* Simple patch to prevent the common zero-width code points in our source and some types of resource files

* Validate correct UTF-8 input and fix buggy CSS file (ISO-8859-x encoded)

* add a bit of context

* Add CHANGES.txt
2023-12-13 17:27:05 +01:00
Kaival Parikh 6c5dcc1795
Fix failing BaseVectorSimilarityQueryTestCase#testApproximate (#12922)
Discovered in #12921, and introduced in #12679 

The first issue is that we weren't advancing the `VectorScorer` [here](cf13a92950/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L257-L262)) -- so it was still un-positioned while trying to compute the similarity score

Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](cad565439b/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L107))) -- so we didn't need to explicitly advance it

Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790f28)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](5096790f28/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L255))) and I missed advancing the `VectorScorer` explicitly..

After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search

Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit)

Sorry for missing this earlier!
2023-12-13 10:11:45 -05:00
Robert Muir 98d2df17d5
enable error-prone's DisableUnicodeInCode check (#12936)
Closes #12931
2023-12-13 08:19:22 -05:00
ChrisHegarty 6b24910e4a Add changelog entries for 9.9.1 2023-12-13 11:44:01 +00:00
ChrisHegarty 487830ed05 Add back-compat indices for 9.9.0 2023-12-13 11:41:23 +00:00
ChrisHegarty 8324a890fe Fix doap 9.9.0 revision 2023-12-13 09:18:51 +00:00
ChrisHegarty f5059231a8 DOAP changes for release 9.9.0 2023-12-13 08:52:56 +00:00
Mike McCandless ee3d60ff92 fix silly typo 2023-12-12 13:04:45 -05:00
Mike McCandless 1ac1b1cadc #12924: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format 2023-12-12 12:29:01 -05:00
Michael Sokolov e18f9b1eb0
Add forceMerge to test to fix intermittent failure; addresses #12896 (#12928) 2023-12-12 09:11:24 -05:00
Uwe Schindler 10387f136f Fix encoding problem caused by invisible character with ExtractJdkApis.java 2023-12-12 15:00:01 +01:00