Commit Graph

37018 Commits

Author SHA1 Message Date
Adrien Grand bf45ab79ec
Beef up `Terms#intersect` checks in `CheckIndex`. (#12926)
Now also testing what happens with a non-null `startTerm`. This found bugs in
`DirectPostingsFormat`.
2023-12-19 11:17:38 +01:00
Lukáš Vlček 5d6086e199
Fix position increment in (Reverse)PathHierarchyTokenizer (#12875)
* Fix PathHierarchyTokenizer positions

PathHierarchyTokenizer was emitting multiple tokens in the same position
with changing offsets. To be consistent with EdgeNGramTokenizer (which
is conceptually similar -- it's emitting multiple prefixes/suffixes off
the input string), we can output every token with length 1 with
positions incrementing by 1.

* Fix ReversePathHierarchyTokenizer positions

Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer.

---------

Co-authored-by: Michael Froh <froh@amazon.com>
2023-12-18 08:48:22 -05:00
Dawid Weiss 6bb244a932
An improved check for ignoring the c2-crash test if running on a client compiler. (#12953) 2023-12-18 12:37:57 +01:00
ChrisHegarty f6582ce048 Add back-compat indices for 9.9.1 2023-12-17 09:39:46 +00:00
ChrisHegarty 08728bf202 Add bugfix version 9.9.1 2023-12-17 09:20:34 +00:00
ChrisHegarty 1f1d0735c8 DOAP changes for release 9.9.1 2023-12-16 22:55:20 +00:00
Michael Sokolov 49d521145d
Use hppc IntIntHashMap to avoid Integer box/unbox when remapping vector ordinals during merge (#12950) 2023-12-15 13:24:05 -05:00
Benjamin Trent 423f8279f0
Fix flaky tests that are caused by small float vectors (#12943)
While quantization generally works well, when the number of dimensions is tiny (just two like in our tests), and we are indexing a circle, and we have random merge policies, we can end up getting unexpected ordering on the resulting vectors.

closes: https://github.com/apache/lucene/issues/12940
2023-12-14 14:38:22 -05:00
Michael McCandless d1551da027 #12932: get monsters tests compiling/running again (#12942) 2023-12-14 10:14:45 -05:00
Stefan Vodita b0ebb849f5
Introduce growInRange to reduce array overallocation (#12844)
In cases where we know there is an upper limit to the potential size
of an array, we can use `growInRange` to avoid allocating beyond that
limit.
2023-12-14 23:00:26 +09:00
Michael McCandless ebf9e29570
Ensure Nori/Kuromoji shipped binary FST is the latest version (#12933)
* ensure Nori/Kuromoji shipped binary FST is the latest version (closes #12911)

* fold feedback from @uschindler: sharpen test failure methods to give the specific gradlew command to regenerate the precise FST (not everything)

* add javadoc for FSTMetadata.getVersion
2023-12-14 07:38:34 -05:00
Jakub Slowinski 3965319441
Attempting to clean up some remaining Solr references (#12939)
* Attempting to clean up some remaining Solr references

* Update gradle/help.gradle

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>

---------

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
2023-12-14 06:02:16 -05:00
Patrick Zhai da69346257 Add CHANGES.txt entry for #12910 2023-12-14 09:14:18 +09:00
Patrick Zhai f303d29baf
Refactor around NeighborArray (#12910) 2023-12-14 09:03:44 +09:00
Uwe Schindler 16d0b822b3
Prevent the common zero-width code points and detect invalid UTF-8 encoding in our sources and selected resource files (#12937)
* Simple patch to prevent the common zero-width code points in our source and some types of resource files

* Validate correct UTF-8 input and fix buggy CSS file (ISO-8859-x encoded)

* add a bit of context

* Add CHANGES.txt
2023-12-13 17:27:05 +01:00
Kaival Parikh 6c5dcc1795
Fix failing BaseVectorSimilarityQueryTestCase#testApproximate (#12922)
Discovered in #12921, and introduced in #12679 

The first issue is that we weren't advancing the `VectorScorer` [here](cf13a92950/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L257-L262)) -- so it was still un-positioned while trying to compute the similarity score

Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](cad565439b/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L107))) -- so we didn't need to explicitly advance it

Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790f28)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](5096790f28/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L255))) and I missed advancing the `VectorScorer` explicitly..

After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search

Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit)

Sorry for missing this earlier!
2023-12-13 10:11:45 -05:00
Robert Muir 98d2df17d5
enable error-prone's DisableUnicodeInCode check (#12936)
Closes #12931
2023-12-13 08:19:22 -05:00
ChrisHegarty 6b24910e4a Add changelog entries for 9.9.1 2023-12-13 11:44:01 +00:00
ChrisHegarty 487830ed05 Add back-compat indices for 9.9.0 2023-12-13 11:41:23 +00:00
ChrisHegarty 8324a890fe Fix doap 9.9.0 revision 2023-12-13 09:18:51 +00:00
ChrisHegarty f5059231a8 DOAP changes for release 9.9.0 2023-12-13 08:52:56 +00:00
Mike McCandless ee3d60ff92 fix silly typo 2023-12-12 13:04:45 -05:00
Mike McCandless 1ac1b1cadc #12924: regenerate binary FSTs for Nori and Kuromoji dictionaries to match current FST format 2023-12-12 12:29:01 -05:00
Michael Sokolov e18f9b1eb0
Add forceMerge to test to fix intermittent failure; addresses #12896 (#12928) 2023-12-12 09:11:24 -05:00
Uwe Schindler 10387f136f Fix encoding problem caused by invisible character with ExtractJdkApis.java 2023-12-12 15:00:01 +01:00
Adrien Grand e0f4321b40
Check `Terms#intersect` in CheckIndex. (#12925)
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes
`LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with
real-world data.

With this change, `TestLucene90PostingsFormat` now exhibits #12895.
2023-12-12 14:08:44 +01:00
gf2121 05b14e23b1
Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped (#12900) 2023-12-12 20:26:02 +08:00
Greg Miller cf13a92950
Make TestDrillSideways#testCollectionTerminated less strict (#12920) 2023-12-11 19:20:08 -08:00
Chris Hegarty a6f70ad2bb
Reflow computeCommonPrefixLengthAndBuildHistogram to avoid crash (#12905)
This commit reflows the code in the method body of computeCommonPrefixLengthAndBuildHistogram, so as to avoid a JVM JIT crash. The purpose of this change is to workaround the JVM bug which is somewhat fragile, but the best that we can do for now and appears to be working well.
2023-12-11 20:10:03 +00:00
Kaival Parikh cd195980ec
Add support for similarity-based vector searches (#12679)
### Description

Background in #12579

Add support for getting "all vectors within a radius" as opposed to getting the "topK closest vectors" in the current system

### Considerations

I've tried to keep this change minimal and non-invasive by not modifying any APIs and re-using existing HNSW graphs -- changing the graph traversal and result collection criteria to:
1. Visit all nodes (reachable from the entry node in the last level) that are within an outer "traversal" radius
2. Collect all nodes that are within an inner "result" radius

### Advantages

1. Queries that have a high number of "relevant" results will get all of those (not limited by `topK`)
2. Conversely, arbitrary queries where many results are not "relevant" will not waste time in getting all `topK` (when some of them will be removed later)
3. Results of HNSW searches need not be sorted - and we can store them in a plain list as opposed to min-max heaps (saving on `heapify` calls). Merging results from segments is also cheaper, where we just concatenate results as opposed to calculating the index-level `topK`

On a higher level, finding `topK` results needed HNSW searches to happen in `#rewrite` because of an interdependence of results between segments - where we want to find the index-level `topK` from multiple segment-level results. This is kind of against Lucene's concept of segments being independently searchable sub-indexes?

Moreover, we needed explicit concurrency (#12160) to perform these in parallel, and these shortcomings would be naturally overcome with the new objective of finding "all vectors within a radius" - inherently independent of results from another segment (so we can move searches to a more fitting place?)

### Caveats

I could not find much precedent in using HNSW graphs this way (or even the radius-based search for that matter - please add links to existing work if someone is aware) and consequently marked all classes as `@lucene.experimental`

For now I have re-used lots of functionality from `AbstractKnnVectorQuery` to keep this minimal, but if the use-case is accepted more widely we can look into writing more suitable queries (as mentioned above briefly)
2023-12-11 14:18:36 -05:00
Shubham Chaudhary 1630ed4bd8
Remove some redundant modifiers from code (#12880) 2023-12-11 10:17:47 -08:00
Adrien Grand c0fd4404ea
Add tests for the 9.0->9.8 block tree terms dict format back. (#12908)
I just noticed that the move from FOR to PFOR did all the work to make the old
format (FOR) writeable, but missed keeping an instance of
`BasePostingsFormatTestCase` for this format.
2023-12-11 18:14:28 +01:00
Jakub Slowinski 4d57070bf7
Removing @lucene.experimental tags in testXXX methods in CheckIndex (#12893) 2023-12-11 11:31:18 -05:00
Michael McCandless af11ea2562 #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895 (#12913)
* #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895

* woops, forgot to tidy up

* #12901: Ignore failing test; reflow text to workaround spotless' poor text formatting skills
2023-12-11 08:53:53 -05:00
Adrien Grand 069c048770
Add tests for Lucene90PostingsFormat back (#12904)
I just noticed that the move from FOR to PFOR did all the work to make the old
format (FOR) writeable, but missed keeping an instance of
`BasePostingsFormatTestCase` for this format.
2023-12-11 13:09:40 +01:00
Usman Shaikh 6a56b2ea7d
Fix typo Levenstein -> Levenshtein (#12519)
Please do not lose sight of the exquisite irony of mis-spelling this particular word ;)
2023-12-11 05:49:36 -05:00
Greg Miller a9b5ef4749
Ensure #finish is called on all drill-sideways FacetCollectors even when no hits are scored (#12853) 2023-12-08 15:25:57 -08:00
Dzung Bui fb269c9e64
Fix NPE on off-heap test and FST is null (#12894) 2023-12-08 09:59:39 -08:00
Stefan Vodita 1c4c1c831d
Quick exit for non-zero slice buffers (#12812) 2023-12-08 09:41:58 -08:00
Greg Miller 18a6033a7e
Remove DrillSideways#createDrillDownFacetsCollector in favor of the manager-based hook (#12855) 2023-12-08 09:37:51 -08:00
Mike McCandless 9ec938f2ef Revert "SlowCompositeCodecWrapper: remove redundant FieldInfo lookup in its PointsReader.getValues method"
This reverts commit 0aff413797.
2023-12-08 08:14:16 -05:00
Dawid Weiss ba81826951
Make unified highlighter tests avoid mock random merge policy's document reordering (#12889) 2023-12-08 08:27:45 +01:00
Stefan Vodita ad9eff3c27
Shorten getOrdinal synchronized loop (#12870) 2023-12-07 16:27:49 -08:00
Chris Hegarty 5e1e6c9e68
Upgrade ECJ to 3.36.0 (#12888)
This commit upgrades ECJ to 3.36.0, as it has support for more recent Java versions, like Java 21.
2023-12-07 21:13:10 +00:00
Zach Chen fcc6f71f0e Move change entry for GITHUB#11041 from version 10.0.0 to version 9.10.0 2023-12-07 11:01:46 -08:00
Dawid Weiss 755de5aae3
Performance improvements to MatchHighlighter and MatchRegionRetriever (#12881) 2023-12-07 18:42:13 +01:00
Dzung Bui b699981847
Allow FST builder to use different writer (#12543) (#12624)
* Move size() to FSTStore

* Remove size() completely

* Allow FST builder to use different DataOutput

* access BytesStore byte[] directly for copying

* Rename BytesStore

* Change class to final

* Reorder methods

* Remove unused methods

* Rename truncate to setPosition() and remove skipBytes()

* Simplify the writing operations

* Update comment

* remove unused parameter

* Simplify BytesStore operation

* tidy code

* Rename copyBytes to writeTo

* Simplify BytesStore operations

* Embed writeBytes() to FSTCompiler

* Fix the write bytes method

* Remove the default block bits constant

* add assertion

* Rename method parameter names

* Move reverse to FSTCompiler

* Revert setPosition call

* Address comments

* Return immediately when writing 0 bytes

* Add comment &

* Rename variables

* Fix the compile error

* Remove isReadable()

* Remove isReadable()

* Optimize ReadWriteDataOutput

* tidy code

* Freeze the DataOutput once finished()

* Refactor

* freeze the DataOutput before use

* Address comments and add off-heap FST tests

* Remove the hardcoded random

* Ignore the Test2BFSTOffHeap test

* Simplify ReadWriteDataOutput

* Update Javadoc

* Address comments
2023-12-07 09:26:03 -05:00
Mike McCandless f55f2af0d0 SlowCompositeCodecWrapper: remove redundant FieldInfo lookup in its PointsReader.getValues method 2023-12-07 09:03:10 -05:00
Mike McCandless 4f1b967be9 #11023: move CHANGES entry down to 9.10 section 2023-12-07 06:45:40 -05:00
Jakub Slowinski 37cca614fc
CheckIndex - Removal of some dead code (#12876)
* CheckIndex - Remove some dead code

* Add back testPostings and testTermVectors
2023-12-07 06:34:33 -05:00