Commit Graph

35731 Commits

Author SHA1 Message Date
Alan Woodward 2183756f1c
LUCENE-10413: Make default Ukrainian stopword set available (#665)
This commit adds a new getDefaultStopwords() static method to
UkrainianMorfologikAnalyzer, which makes it possible to create an
analyzer with the default stop word set but a custom stem exclusion
set.
2022-02-09 14:37:44 +00:00
Greg Miller 8178ffda00
LUCENE-10403: Add ArrayUtil#grow(T[]) (#644) 2022-02-08 09:43:55 -08:00
Adrien Grand ce93d45532 LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of matching clauses is a constant. 2022-02-08 17:25:53 +01:00
Nhat Nguyen bcb70fd742
LUCENE-10190: Ensure changes are visible before advancing seqno (#640)
DocumentWriter#anyChanges() can return false after we process and
generate a sequence number for an update operation; but before we adjust
the numDocsInRAM. In this window of time, refreshes are noop, although
the maxCompletedSequenceNumber has advanced.
2022-02-08 10:29:20 -05:00
gf2121 5250186bd1
LUCENE-10410: Add more tests for legacy decoding logic in DocIdsWriter (#654) 2022-02-08 16:59:32 +08:00
Tomoko Uchida 20f7f33c8d
LUCENE-10400: cleanup obsolete APIs in kuromoji (#655) 2022-02-08 09:32:33 +09:00
Julie Tibshirani eb5bdd7d15
Rename KnnGraphValues -> HnswGraph (#645)
This PR proposes some renames to clarify the code structure. The top-level
`KnnGraphValues` is renamed to `HnswGraph`, since it now represents a
hierarchical graph. It's also moved from `org.apache.lucene.index` to the
`hnsw` package.

Other renames:
* The old `HnswGraph` -> `OnHeapHnswGraph`
* `IndexedKnnGraphValues` -> `OffHeapHnswGraph` (to match
`OffHeapVectorValues`)
2022-02-07 13:21:15 -08:00
Tomoko Uchida e7546c2427
LUCENE-10400: revise binary dictionaries' constructor in kuromoji (#643) 2022-02-07 19:31:22 +09:00
gf2121 e93b08f471
LUCENE-10315: Add CHANGES for #541 (#653) 2022-02-07 16:23:34 +08:00
gf2121 8c67a3816b
LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (#541) 2022-02-07 15:35:54 +08:00
Ignacio Vera 4c578017af
LUCENE-10405: binary and Sorted doc values are stored as BytesRef instead of BytesRefHash in memory index (#647)
When using the MemoryIndex, binary and Sorted doc values are stored 
as BytesRef instead of BytesRefHash so they don't have a limit on size.
2022-02-07 07:33:07 +01:00
Greg Miller deef3c704e
Update github hunspell regression test to use JDK 17 (#651) 2022-02-06 08:00:31 -08:00
Gautam Worah de4eccbb55
LUCENE-10050 Remove DrillSideways#search(DrillDownQuery,Collector) in favor of DrillSideways#search(DrillDownQuery,CollectorManager) (#632) 2022-02-04 15:25:52 -08:00
Mayya Sharipova ff2189c477 Add changes item for LUCENE-10054 2022-02-04 14:51:48 -05:00
Mayya Sharipova ea4ab26e52 LUCENE-9573 Add Vectors to TestBackwardsCompatibility (#636)
Update index.9.0.0-cfs.zip and index.9.0.0-nocfs.zip
to include knn vector field.
2022-02-04 14:42:44 -05:00
Alan Woodward 6b64f4b556
LUCENE-10407: Set bpos flag to true when containing filter is exhausted (#648)
ContainedByIntervalIterator and OverlappingIntervalIterator set their 'is the filter
interval exhausted' flag to `false` once it has returned NO_MORE_POSITIONS on
a document, so that subsequent calls to `startPosition()` will also return
NO_MORE_POSITIONS. ContainingIntervalIterator omits to do this, and so it can
incorrectly report matches, for example when used in a disjunction.  This commit
fixes that omission.
2022-02-04 16:44:57 +00:00
Alan Woodward 9ebee5a058 LUCENE-10402: Changes entry 2022-02-04 15:28:44 +00:00
Alan Woodward e72d796e96
LUCENE-10402: Prefix interval automaton should be declared binary (#646) 2022-02-04 15:27:03 +00:00
Adrien Grand ed6c1b5aea
LUCENE-10401: Fix lookups on empty doc-values terms dictionaries. (#642) 2022-02-04 09:28:35 +01:00
Julie Tibshirani 57d9515eff
LUCENE-10391: Reuse data structures across HnswGraph#searchLevel calls (#641)
A couple of the data structures used in HNSW search are pretty large and
expensive to allocate. This commit creates a shared candidates queue and
visited set that are reused across calls to HnswGraph#searchLevel. Now the same
data structures are used for building the entire graph, which can cut down on
allocations during indexing. For graph building it also switches the visited
set to FixedBitSet for better performance.
2022-02-03 16:00:09 -08:00
Luca Cavanna bade484998
LUCENE-10385: Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery (#635)
IndexSortSortedNumericDocValuesRangeQuery can implement its count method and coompute count through a binary search, the same binary search that is used to execute the query itself, whenever all the required conditions are met.
2022-02-03 17:19:05 +01:00
Luca Cavanna ee7a8d6918
LUCENE-10002: Replace some IndexSearcher#search(Collector, Query) in tests (#639)
Also use LuceneTestCase#newSearcher
2022-02-03 17:17:02 +01:00
Dawid Weiss 9a28c91a5a LUCENE-10283: bump the minimum source/release in javadoc settings. 2022-02-02 17:25:50 +01:00
Dawid Weiss 87bba4152c LUCENE-10283: bump the minimum source/release in ecj linter settings. 2022-02-02 17:25:41 +01:00
Mike Drob 56f49257ed
null check on infoStream (#637) 2022-02-02 09:44:31 -06:00
Mayya Sharipova c8e1c08cc8
Small fix for assertConsistentGraph (#631)
TestKnnGraph.testMultipleVectorFields sometimes breaks with
the following message:

java.lang.NullPointerException: Cannot invoke
 "org.apache.lucene.codecs.lucene91.Lucene91HnswVectorsReader.getGraphValues(String)"
because "vectorReader" is null

This happens in assertConsistentGraph.

This patch ensures that for a segment and a field  where there is no
vectors indexed, we don't run a check on consistent graph.
2022-02-01 10:21:48 -05:00
Dawid Weiss f103cca565 LUCENE-10255: Add the required unnamed modules in benchmarks subproject to module-info so that they are explicit. 2022-02-01 12:15:01 +01:00
Dawid Weiss e7212fa47d LUCENE-10283: bump minimum JDK version to 17 in buildSrc. 2022-02-01 12:09:35 +01:00
Mayya Sharipova 8dfdb261e7
LUCENE-9573 Add Vectors to TestBackwardsCompatibility (#616)
This patch adds KNN vectors for testing backward compatible indices

- Add a KnnVectorField to documents when creating a new backward
  compatible index
- Add knn vectors search and check for vector values to the testing
  of search of backward compatible indices
- Add tests for knn vector search when changing backward compatible
 indices (merging them and adding new documents to them)
2022-01-31 09:20:53 -05:00
Luca Cavanna df12e2b195
LUCENE-10395: Introduce TotalHitCountCollectorManager (#622) 2022-01-31 14:45:35 +01:00
Luca Cavanna 933c54fe87
Improve Weight#count and IndexSearcher#count javadocs (#625) 2022-01-28 16:47:25 +01:00
Robert Muir 61edacee5d
update javac flags for java 17 (#628)
Previously -Xlint:text-blocks and -Xlint:text-blocks were enabled
conditionally, if the user had at least java 15 or java 16,
respectively. Enable them always.

Add new options so that the warnings list is fully configured:
* -Xlint:module (new in java 17)
* -Xlint:strictfp (new in java 17)

Disable "path" with -Xlint:-path rather than commenting it out, for
consistency.

Disable "missing-explicit-ctor" (new in java 17), as it is unlikely to
succeed right now.

Alphasort the flags and doc how to get the updated list, this makes it
easy to compare and keep up to date.
2022-01-28 05:48:58 -05:00
Adrien Grand 09ddac1fe5
Simplify HnswGraph#search. (#627)
Currently the contract on `bound` is that it holds the score of the top of the
`results` priority queue. It means that a candidate is only considered if its
score is better than the bound *or* if less than `topK` results have been
accumulated so far. I think it would be simpler if `bound` would always hold
the minimum score that is required for a candidate to be considered? This would
also be more consistent with how our WAND support works, by trusting
`setMinCompetitiveScore` alone, instead of having to check whether the priority
queue is full as well.
2022-01-27 18:08:06 +01:00
Greg Miller 4323848469
LUCENE-10368: Make IntTaxonomyFacets pkg-private (#600) 2022-01-27 08:56:42 -08:00
Mayya Sharipova dcd9e3d6f7
LUCENE-10389 Adjust TestHnswGraph.testRandom (#626)
Before PR #608 this test when searching HnswGraph was using
numSeed (the search queue size) equal to 100.
This patch returns the original value of the search queue to 100,
and gets the top topK results from it.
2022-01-27 09:06:48 -05:00
gf2121 eda9c29b8c
LUCENE-10388: Remove MultiLevelSkipListReader#SkipBuffer (#620) 2022-01-26 16:36:33 +08:00
gf2121 3ad4c1b3c9
clean lastPayloadByteUpto (#619) 2022-01-26 14:03:32 +08:00
Tomoko Uchida e18dfea8bd LUCENE-10389: temporary disable TestHnswGraph.testRandom() 2022-01-26 10:45:55 +09:00
Mayya Sharipova b0d6fe68d1
LUCENE-10054 Make HnswGraph hierarchical (#608)
Currently HNSW has only a single layer.
This patch makes HNSW graph multi-layered.

This PR is based on the following PRs:
 #250, #267, #287, #315, #536, #416

Main changes:
- Multi layers are introduced into HnswGraph and HnswGraphBuilder
- A new Lucene91HnswVectorsFormat with new Lucene91HnswVectorsReader
and Lucene91HnswVectorsWriter are introduced to encode graph
layers' information
- Lucene90Codec, Lucene90HnswVectorsFormat, and the reading logic of
Lucene90HnswVectorsReader and Lucene90HnswGraph are moved to
backward_codecs to support reading and searching of graphs built
in pre 9.1 version. Lucene90HnswVectorsWriter is deleted.
- For backwards compatible tests, previous Lucene90 graph reading and
writing logic was copied into test files of
Lucene90RWHnswVectorsFormat, Lucene90HnswVectorsWriter,
Lucene90HnswGraphBuilder and Lucene90HnswRWGraph.

TODO: tests for KNN search for graphs built in pre 9.1 version;
tests for merge of indices of pre 9.1 + current versions.
2022-01-25 13:53:55 -05:00
Mayya Sharipova 1a4f838fe2
LUCENE-10384: Simplify LongHeap small addition (#623)
LUCENE-10384 and PR#615 introduced encoding f into NeighborQueue.
But one function `nodes()` was remained to add this encoding.

Also modify the test that would fail without this patch.
2022-01-25 11:43:40 -05:00
Luca Cavanna 11006fba59
LUCENE-10002: Replace simple usages of TotalHitCountCollector with IndexSearcher#count (#612)
In case only number of documents are collected, IndexSearcher#search(Query, Collector) is commonly used, which does not use the executor that's been eventually set to the searcher. Calling `IndexSearcher#count(Query)` makes the code more concise and is also more correct as it honours the executor that's been set to the searcher instance.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-01-25 16:11:19 +01:00
Tomoko Uchida fd817b6fb1 LUCENE-8930: increase timeout to 1 minite for the launched luke (seems it occationaly takes long time on windows vm) 2022-01-25 22:56:11 +09:00
Tomoko Uchida 0aa526d256 LUCENE-10076: fix the assertion in luke module to only check if the optional has a value 2022-01-25 22:05:27 +09:00
Adrien Grand 07fe46ff86
LUCENE-10384: Simplify LongHeap. (#615)
The min/max ordering logic moves to NeighborQueue.
2022-01-25 09:04:52 +01:00
Greg Miller eaf3cb6739 Fix minor bug that snuck in with LUCENE-9952 2022-01-24 06:58:31 -08:00
Greg Miller 9e560c1af1
LUCENE-9952: Fix dim count inaccuracies in SSDV faceting when a dim is multi-valued (#611) 2022-01-24 06:48:20 -08:00
Greg Miller 10ca531ddc
LUCENE-10381: Require users to provide FacetsConfig for SSDV faceting (#613) 2022-01-24 06:46:22 -08:00
Julie Tibshirani fb09ae1f7c Undo accidental change to build.gradle 2022-01-23 16:26:16 -08:00
Julie Tibshirani 7ece8145bc
LUCENE-10375: Write vectors to file in flush (#617)
In a previous commit, we updated HNSW merge to first write the combined segment
vectors to a file, then use that file to build the graph. This commit applies
the same strategy to flush, which lets us use the same logic for flush and
merge.
2022-01-23 16:19:23 -08:00
Dawid Weiss 08d6633d94 LUCENE-8930: increase timeout for the launched luke. 2022-01-20 16:51:05 +01:00