Commit Graph

13930 Commits

Author SHA1 Message Date
John Mazanec 776149f0f6
Reuse HNSW graph for intialization during merge (#12050)
* Remove implicit addition of vector 0

Removes logic to add 0 vector implicitly. This is in preparation for
adding nodes from other graphs to initialize a new graph. Having the
implicit addition of node 0 complicates this logic.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Enable out of order insertion of nodes in hnsw

Enables nodes to be added into OnHeapHnswGraph in out of order fashion.
To do so, additional operations have to be taken to resort the
nodesByLevel array. Optimizations have been made to avoid sorting
whenever possible.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Add ability to initialize from graph

Adds method to initialize an HNSWGraphBuilder from another HNSWGraph.
Initialization can only happen when the builder's graph is empty.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Utilize merge with graph init in HNSWWriter

Uses HNSWGraphBuilder initialization from graph functionality in
Lucene95HnswVectorsWriter. Selects the largest graph to initialize the
new graph produced by the HNSWGraphBuilder for merge.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Minor modifications to Lucene95HnswVectorsWriter

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Use TreeMap for graph structure for levels > 0

Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of
levels greater than 0. Refactors NodesIterator to support set
representation of nodes.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Refactor initializer to be in static create method

Refeactors initialization from graph to be accessible via a create
static method in HnswGraphBuilder.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Address review comments

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Add change log entry

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Remove empty iterator for neighborqueue

Signed-off-by: John Mazanec <jmazane@amazon.com>

---------

Signed-off-by: John Mazanec <jmazane@amazon.com>
2023-02-07 14:42:03 -05:00
Adrien Grand ab074d5483
Introduce a new `KeywordField`. (#12054)
`KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`,
similarly to how `LongField` is a combination of `LongPoint` and
`SortedNumericDocValuesField`. This makes it easier for users to create fields
that can be used for filtering, sorting and faceting.
2023-02-07 18:19:09 +01:00
Adrien Grand d69326408c Fix javadoc references. 2023-02-07 11:06:04 +01:00
Adrien Grand 8b572f074e Fix nightly compatibility tests after #12116. 2023-02-07 10:46:30 +01:00
Uwe Schindler bc09c2a0d9
Add tests for size() and contains() to LongHashSet; fix size bug with MISSING (#12134) 2023-02-07 00:52:29 +01:00
Uwe Schindler 57403e26e0
Simplify LongHashSet by completely removing java.util.Set APIs (#12133) 2023-02-06 22:43:20 +01:00
Uwe Schindler 8564da434d
Generate gradle.properties from gradlew (#12131)
* SOLR-16641 - Generate gradle.properties from gradlew (#1320)
* Adapt for Lucene
* Remove localSettings from smoker; thanks @colvinco
* Print properties at end for debugging
* Add CHANGES.txt entry

---------

Co-authored-by: Colvin Cowie <colvin.cowie.dev@gmail.com>
Co-authored-by: Colvin Cowie <51863265+colvinco@users.noreply.github.com>
2023-02-06 19:47:15 +01:00
Robert Muir 02b202866c
Add CHANGES.txt for #12129 2023-02-06 12:49:49 -05:00
Robert Muir 0bc4135695
Speedup sandbox/DocValuesTermsQuery (#12129)
* Optimize the common case that docs only have single values for the field
* In the multivalued case, terminate reading docvalues if they are > maximum set ordinal
* Implement ScorerSupplier, so that (potentially large) number of ordinal lookups aren't performed just to get the cost()
* Graduate to Sorted(Set)DocValuesField.newSlowSetQuery to complement newSlowRangeQuery, newSlowExactQuery

Like other slow queries in these classes, it's currently only recommended to use with points, e.g. IndexOrDocValuesQuery(new PointInSetQuery, newSlowSetQuery)
2023-02-06 12:47:53 -05:00
Robert Muir 10d9c7440b
Speed up docvalues set query by making use of sortedness (#12128)
LongHashSet is used for the set of numbers, but it has some issues:
* tries to hard to extend AbstractSet, mostly for testing
* causes traps with boxing if you aren't careful
* complex hashcode/equals

Practically we should take advantage of the fact numbers come in sorted
order for multivalued fields: just like range queries do. So we use
min/max to our advantage, including termination of docvalues iteration

Actually it is generally a win to just check min/max even in the single-valued
case: these constant time comparisons are cheap and can avoid hashing,
etc.

In the worst-case, if all of your query Sets contain both the minimum and maximum
possible values, then it won't help, but it doesn't hurt either.
2023-02-06 12:14:02 -05:00
Robert Muir a6bceb7cf0
Remove useless abstractions in DocValues-based queries (#12127)
There's no need to make things abstract: DocValues does the right thing
Optimizing for where no docs for the field in the segment exist is easy, simple null check (replacing the existing one!)
2023-02-06 11:49:25 -05:00
Adrien Grand 684f31ef06 Fix ambiguous links. (#12116) 2023-02-06 17:45:05 +01:00
Adrien Grand 96136b4282
Improve document API for stored fields. (#12116)
Currently stored fields have to look at binaryValue(), stringValue() and
numericValue() to guess the type of the value and then store it. This has a few
issues:
 - If there is a problem, e.g. all of these 3 methods return null, it's
   currently discovered late, when we already passed the responsibility of
   writing data from IndexingChain to the codec.
 - numericValue() is used both for numeric doc values and storage. This makes
   it impossible to implement a `double` field that is stored and doc-valued,
   as numericValue() needs to return simultaneously a number that consists of
   the double for storage, and the long bits of the double for doc values.
 - binaryValue() is used both for sorted(_set) doc values and storage. This
   makes it impossible to implement `keyword` fields that is stored and
   doc-valued, as the field returns a non-null value for both binaryValue() and
   stringValue() and stored fields no longer know which field to store.

This commit introduces `IndexableField#storedValue()`, which is used only for
stored fields. This addresses the above issues. IndexingChain passes the
storedValue() directly to the codec, so it's impossible for a stored fields
format to mistakenly use binaryValue()/stringValue()/numericValue() instead of
storedValue().
2023-02-06 17:08:13 +01:00
Benjamin Trent c2bef381d1
Fix TestFeatureField#testBasicsNonScoringCase test (#12130)
Sometimes the random search lucene test searcher will wrap the reader. Consequently, we need to make sure to use the reader provided by the test IndexSearcher or the reader may be different between creating the weight with the searcher vs. accessing the leaf context for the scorer.
2023-02-06 10:15:00 -05:00
Benjamin Trent 4bbc273a43
Add `FeatureQuery` weight caching in non-scoring case (#12118)
While FeatureQuery is a powerful tool in the scoring case, there are scenarios when caching should be allowed and scoring disabled.

A particular case is when the FeatureQuery is used in conjunction with learned-sparse retrieval. It is useful to iterate and calculate the entire matching doc set when combined with various other queries.

related to: https://github.com/apache/lucene/issues/11799
2023-02-02 13:00:33 -05:00
Ioana Tagirta d591c9c37a
Always generate a polygon that has no self intersections (#12124) 2023-02-02 16:57:42 +01:00
Jean-François BOEUF 5acca82633
Reduce bloom filter size by using the optimal count for hash functions. (#11900) 2023-02-01 14:35:50 +01:00
Marc D'Mello f9cb6a3b42
GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes (#11958)
Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
2023-02-01 14:11:12 +01:00
Luca Cavanna 57397f0cab
Adjust return type for VectorUtil methods (#12122)
Two of the methods (squareDistance and dotProduct) that take byte arrays return a float while
the variable used to store the value is an int. They can just return an int.
2023-02-01 10:48:05 +01:00
Luca Cavanna ce433e5449
Remove VectorUtil#toBytesRef (#12121)
The method is currently only used in its corresponding test method.
2023-02-01 10:47:50 +01:00
Luca Cavanna 73e2ae2705 Add back-compat indices for 9.5.0 2023-01-30 20:50:01 +01:00
Luca Cavanna 2bd87b7909 Fix formatting of Version.java 2023-01-25 13:49:23 +01:00
Luca Cavanna 59da15b0e5 Add minor version 9.6.0 2023-01-25 13:42:38 +01:00
Luca Cavanna 72c8b334a0 Add missing LUCENE_9_5_0 version 2023-01-25 13:37:48 +01:00
Benjamin Trent d1fa52e62f
Fix flaky TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults test (#12110) 2023-01-25 10:47:20 +01:00
Luca Cavanna 5a51ce1d5d
SimpleText codec to support writing byte vectors (#12111)
A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written.
This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter.

Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support
byte vectors and will throw unsupported operation exception when attempting to do that.
2023-01-25 10:43:35 +01:00
Luca Cavanna 95e2cfcc1e
Remove deprecated float vector classes and methods (#12107)
Follow-up of #12105 to remove the deprecated classes for the next major version.

Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.
2023-01-24 16:25:36 +01:00
Adrien Grand ce8eaf138c
MemoryIndex should not fail integer fields that enable doc values. (#12109)
When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast
to `java.lang.Long`. However, the new `IntField` represents the value as a
`java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with
`IndexingChain` by casting to `Number` and calling `Number#longValue` instead
of casting to `Long`.
2023-01-24 11:51:49 +01:00
Luca Cavanna 25623a63bf format changelog entry and add missing author name 2023-01-24 10:57:52 +01:00
Luca Cavanna 92e67ec626
Rename vector float classes and methods (#12105)
We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values.

This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones.

As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues.

Relates to #11963
2023-01-24 10:20:24 +01:00
Luca Cavanna f5bd28662f Remove BytesRef usage from SortingCodeReader
Follow-up of #12102
2023-01-23 22:29:10 +01:00
Luca Cavanna 4594400216
Replace BytesRef usages in byte vectors API with byte[] (#12102)
The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore
2023-01-23 22:06:00 +01:00
Luca Cavanna f8ee852696
add missing changelog for #12064 2023-01-23 21:22:15 +01:00
Adrien Grand de94fa97fb
Remove binaryValue() on VectorValues and ByteVectorValues. (#12101)
This method tries to expose an encoded view of vectors, but we shouldn't have
this part of our user-facing API. With this change, the way vectors are encoded
is entirely on the codec.
2023-01-23 14:42:49 +01:00
Alessandro Benedetti 77eca4bb38
Introduce getters for KnnVectorQuery(#12029) 2023-01-23 12:35:08 +01:00
Chris Hostetter 9007f746a3 WordBreakSpellChecker now correctly respects maxEvaluations (#12077) 2023-01-22 15:44:29 -07:00
Vigya Sharma 519adcc954
Fix failure in TestIndexSortSortedNumericDocValuesRangeQuery.testCountBoundary (#12098) 2023-01-19 16:16:42 -08:00
Lu Xugang a9fd21b6af
Same bound with fallbackQuery (#12084)
IndexSortSortedNumericDocValuesRangeQuery should have the same bound with fallbackQuery.
2023-01-19 14:33:58 +08:00
Vigya Sharma dc33ade76d
Remove UTF8TaxonomyWriterCache (#12092)
Removes the never-evicting UTF8TaxonomyWriterCache, changing the default to LruTaxonomyWriterCache
2023-01-18 13:20:26 -08:00
twosom 318b002e0b
fix typo in KoreanNumberFilter (#12045)
* fix typo in KoreanNumberFilter

* fix doc format
2023-01-17 22:32:59 -08:00
Robert Muir 4fe8424925
Graduate DocValuesNumbersQuery from lucene/sandbox to newSlowSetQuery() (#12087)
Clean up this query a bit and support:
* NumericDocValuesField.newSlowSetQuery()
* SortedNumericDocValuesField.newSlowSetQuery()

This complements the existing docvalues-based range queries, with a set query.

Add ScorerSupplier/cost estimation support to PointInSetQuery
Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery
2023-01-16 09:38:08 -05:00
Alan Woodward fc7b937aff
Don't throw UOE when highlighting FieldExistsQuery (#12088)
WeightedSpanTermExtractor will try to rewrite queries that it doesn't
know about, to see if they end up as something it does know about and
that it can extract terms from. To support field merging, it rewrites against
a delegating leaf reader that does not support getFieldInfos().

FieldExistsQuery uses getFieldInfos() in its rewrite, which means that
if one is passed to WeightedSpanTermExtractor, we get an
UnsupportedOperationException thrown.

This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery,
so that it can just ignore it and avoid throwing an exception.
2023-01-16 11:47:51 +00:00
Robert Muir e06f8c2e8b
Update to error-prone 2.17 (#12056) 2023-01-14 11:38:39 -05:00
Robert Muir 8ca05967e8
remove non-NRT replication support (#12038)
Remove non-NRT replication support in 10.x (to be deprecated in 9.5)
2023-01-14 11:14:46 -05:00
Adrien Grand b5062a2858
MultiCollector shouldn't report that scores are needed when they're not. (#12083)
When sub collectors don't agree on their `ScoreMode`, `MultiCollector`
currently returns `COMPLETE`. This makes sense when assuming that there is
likely one collector computing top hits (`TOP_SCORES`) and another one
computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it
is also possible to have one collector computing top hits by field (`TOP_DOCS`)
and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector`
shouldn't report that scores are needed in that case.
2023-01-13 14:44:17 +01:00
Luca Cavanna 90a5a71448 fix typo in changelog 2023-01-13 14:12:06 +01:00
Lu Xugang 102622842b
Enhance XXXField#newRangeQuery (#12078)
Introduce IndexSortSortedNumericDocValuesRangeQuery to IntFiled#newRangeQuery and LongField#newRangeQuery
2023-01-13 18:12:26 +08:00
Adrien Grand aaab028266
Speed up DocIdMerger on sorted indexes. (#12081)
In the case when an index is sorted on a low-cardinality field, or the index
sort order correlates with the order in which documents get ingested, we can
optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on
the next sub. This checks covers at the same time whether the priority queue
needs reordering and whether the current sub reached `NO_MORE_DOCS`.
2023-01-12 18:27:45 +01:00
Adrien Grand 729fedcbac
Speed up 1D BKD merging. (#12079)
On the NYC taxis dataset on my local machine, switching from
`Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15%
speedup of BKD merging.
2023-01-12 18:14:15 +01:00
Benjamin Trent 59b17452aa
Fix exponential runtime for Boolean#rewrite (#12072)
When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially.

The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively.

This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite.

closes: #12069
2023-01-12 16:50:05 +01:00
Adrien Grand 9ab324d2be Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:52:12 +01:00
Adrien Grand 525a11091c Revert "Allow reusing indexed binary fields. (#12053)"
This reverts commit 84778549af.
2023-01-12 09:48:06 +01:00
Adrien Grand 84778549af
Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:32:13 +01:00
Patrick Zhai d3d9ab0044
Drop wrong assertion in TestBooleanQuery.testQueryMatchesCount (#12051) 2023-01-11 10:44:06 -08:00
Adrien Grand a8ef03d979
Never throttle creation of compound files. (#12070)
`ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a
proxy for CPU usage, in order to prevent merging from disrupting searches too
much. However creating compound files are lightweight CPU-wise and do not need
throttling.

Closes #12068
2023-01-11 09:57:13 +01:00
Adrien Grand 56ec51e558
Cut over Lucene Demo from LongPoint to LongField. (#12052) 2023-01-11 09:43:43 +01:00
Benjamin Trent cc29102a24
Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064) 2023-01-11 09:20:47 +01:00
Erik Pellizzon e14327288e
Documenting that IndexReaderContext#leaves() will never return a null value and remove the null checks from the method calls (#12034) 2023-01-06 14:35:52 -08:00
Uwe Schindler 5fccaec166 Remove deprecated APIs after #12066; this also removes another one missed to be removed before 2023-01-05 11:53:16 +01:00
Uwe Schindler 7f483bd618
Retire/deprecate per-instance MMapDirectory#setUseUnmap (#12066) 2023-01-04 19:17:03 +01:00
Uwe Schindler 2f602c01dd
Add a sysprop "org.apache.lucene.store.MMapDirectory.enableMemorySegments" (#12062) 2023-01-03 19:10:28 +01:00
Lu Xugang 19cc6cdf66
Out of boundary in CombinedFieldQuery#addTerm (#12046) 2023-01-03 15:33:36 +08:00
Uwe Schindler e2ee09d0c5
Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9 that has the bean, but without properties (#12058) 2023-01-02 00:05:36 +01:00
twosom 4676a735c1
fix typo analysis-kuromoji (#12047) 2023-01-01 10:58:50 -05:00
Patrick Zhai 4eab1d74e8
Fix TestRangeOnRaneFacetCounts dimention overflow error (#12049) 2022-12-30 13:18:51 -08:00
Greg Miller 21107d811b Add CHANGES entry for GITHUB#11869 2022-12-30 07:46:51 -08:00
Marc D'Mello cbfed77fd3
Github#11869: Add RangeOnRangeFacetCounts (#11901) 2022-12-30 07:38:13 -08:00
Adrien Grand 6f477e5831
Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037)
This iterates on #399 to also optimize the case when an index sort is
configured. When cutting over the NYC taxis benchmark to the new numeric
fields,
[flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times)
stayed mostly the same when index sorting is disabled and increased by 7-8%
when index sorting is enabled. I expect this change to address this slowdown.
2022-12-27 11:12:56 +01:00
Adrien Grand ddd63d2da3
Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011)
When flushing segments that have an index sort configured, postings lists get
loaded into arrays and get reordered according to the index sort.

This reordering is implemented with `TimSorter`, a variant of merge sort. Like
merge sort, an important part of `TimSorter` consists of merging two contiguous
sorted slices of the array into a combined sorted slice. This merging can be
done either with external memory, which is the classical approach, or in-place,
which still runs in linear time but with a much higher factor. Until now we
were allocating a fixed budget of `maxDoc/64` for doing these merges with
external memory. If this is not enough, sorted slices would be merged in place.

I've been looking at some profiles recently for an index where a non-negligible
chunk of the time was spent on in-place merges. So I would like to propose the
following change:
 - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid
   in-place merges for all postings up to `docFreq = maxDoc / 4`.
 - Make this RAM budget lazily allocated, rather than eagerly like today. This
   would help not allocate memory in O(maxDoc) for fields like primary keys
   that only have a couple postings per term.

So overall memory usage would never be more than 50% higher than what it is
today, because `TimSorter` never needs more than X temporary slots if the
postings list doesn't have at least 2*X entries, and these 2*X entries already
get loaded into memory today. And for fields that have short postings, memory
usage should actually be lower.
2022-12-27 11:11:18 +01:00
Adrien Grand e9dc4f9188
Avoid sorting values of multi-valued writers if there is a single value. (#12039)
They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to
range checks and some logic to determine the optimal sorting algorithm to use
depending on the number of values. We can skip this overhead in the case when
there is a single value.
2022-12-27 11:03:06 +01:00
Zach Chen 008a0d4206
Remove IOContext from Directory#openChecksumInput (#12027) 2022-12-26 11:45:42 -08:00
Uwe Schindler c9401bf064
Patch class files for Java 19 code to no longer have the "preview" flag (this enables Java 19 memory segments by default) (#12033) 2022-12-26 10:07:44 +01:00
Uwe Schindler 92f08aff9f Make childLog final to fix compilation on Java 20. This closes #12041 2022-12-25 14:55:33 +01:00
Lu Xugang 3bc8cd5c20
Aggressive `count` in BooleanWeight (#12017) 2022-12-22 23:48:05 +08:00
twosom ad22fb2879
Fix typo in AbstractQueryConfig javadocs (#12031) 2022-12-22 13:57:29 +01:00
twosom 5c78e04a17
fix typo in BaseSynonymParserTestCase (#12030)
Co-authored-by: hope <hope@gravylab.co.kr>
2022-12-21 13:28:52 -05:00
Egor Potemkin d18e3f1d45
Issue #11582 Update Faceting user guide (#12025)
Update faceting user guide to modern times.

Co-authored-by: Egor Potemkin <epotyom@amazon.com>
2022-12-21 12:20:18 -05:00
Francisco Fernández Castaño 57201aa967
Add IntField, LongField, FloatField and DoubleField (#11997)
This commit adds new IndexableFields that index both points and doc
values at once.

Closes #11199
2022-12-20 18:19:46 +01:00
Benjamin Trent 1412e559d9
Clean up KNN related backward-codecs changes (#12019) 2022-12-20 14:04:42 +01:00
Andriy Redko 945d7fe027
Upgrade ANTLR to version 4.11.1 (#12016)
Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md

"Q: What are the main design decisions in ANTLR4?
Ease-of-use over performance. I will worry about performance later."

It allows us to move forward with newer antlr but hopefully prevent the associated headaches.

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
Co-authored-by: Robert Muir <rmuir@apache.org>
2022-12-15 22:40:35 -05:00
Craig Taverner 3e8ef57e3f
Fix flat polygons incorrectly containing intersecting geometries (#12022) 2022-12-15 14:56:09 +01:00
Benjamin Trent 11f2bc2056
Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004 (#12024) 2022-12-15 14:49:47 +01:00
Benjamin Trent 72968d30ba
Move byte vector queries into new KnnByteVectorQuery (#12004) 2022-12-14 09:53:10 +01:00
Robert Muir 9eeab8c4a6
Remove deprecated API in 10.x (#11998) 2022-12-13 10:32:15 -05:00
Robert Muir 47f8c1baa2
Migrate away from per-segment-per-threadlocals on SegmentReader (#11998)
Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-12-13 09:10:21 -05:00
Ignacio Vera ef5766aa81
Fix algorithm that chooses the bridge between a polygon and a hole (#11988) 2022-12-13 10:16:53 +01:00
Robert Muir 06f9179295
Enable LongDoubleConversion error-prone check (#12010) 2022-12-12 20:55:39 -05:00
Greg Miller e34234ca6c
Remove unnecessary NaN checks from LongRange#verifyAndEncode (#12008) 2022-12-11 12:55:21 -08:00
Greg Miller 8671e29929
Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery (#12003)
* Leverage DISI static factory methods more over custom DISI impl where possible.
* Assert points field is a single-dim.
* Bound cost estimate by the cost of the doc values field (for sparse fields).
2022-12-10 12:23:31 -08:00
gf2121 54e00df7f6
Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (#12006) 2022-12-11 02:30:17 +08:00
gf2121 9ff989ec00
Use ByteArrayComparator to replace Arrays#compareUnsigned in some other places (#11880) 2022-12-08 23:51:08 +08:00
Alan Woodward 66127f6e69
Add support for stored fields to MemoryIndex (#11999) 2022-12-08 09:56:24 +00:00
Adrien Grand a971120d05
Make RandomAccessVectorValues an implementation detail of HNSW implementations rather than a proper API. (#11964)
`RandomAccessVectorValues` is internally used in our HNSW implementation to
provide random access to vectors, both at index and search time. In order to
better reflect this, this change does the following:
 - `RandomAccessVectorValues` moves to `org.apache.lucene.util.hnsw`.
 - `BufferingKnnVectorsWriter` no longer has a dependency on
   `RandomAccessVectorValues` and moves to `org.apache.lucene.codecs` since
   it's more of a utility class for KNN vector file formats than an index API.
   Maybe we should think of moving it near each file format that uses it
   instead.
 - `SortingCodecReader` no longer has a dependency on
   `RandomAccessVectorValues`.

Closes #10623
2022-12-08 08:49:37 +01:00
Adrien Grand 95df7e8109
Generalize range query optimization on sorted indexes to descending sorts. (#11972)
This generalizes #687 to indexes that are sorted in descending order. The main
challenge with descending sorts is that they require being able to compute the
last doc ID that matches a value, which would ideally require walking the BKD
tree in reverse order, but the API only support moving forward. This is worked
around by maintaining a stack of `PointTree` clones to perform the search.
2022-12-08 08:38:53 +01:00
Benjamin Trent d0be9ab57c
GITHUB-11830 Better optimize storage for vector connections (#11860) 2022-12-07 08:51:54 +01:00
Karl David Wright 108462a005 Followup work for #11883 2022-12-03 08:07:10 -05:00
Costin Leau 4eba6a1284
Add exponential growth to TimeLimitingBulkScorer (#11984)
Increase the timeout check inside TimeLimitBulkScorer at exponential rate.

Fix #11676
2022-12-02 09:20:48 -08:00
Robert Muir fad3108b27
fix wrong serialization by ShapeDocValues (#11974)
Closes #11973
2022-12-01 20:32:42 -05:00
Alan Woodward 72ff140f5a
Don't let merged passages push out lower-scoring ones (#11990)
PassageScorer uses a priority queue of size maxPassages to keep track of
which highlighted passages are worth returning to the user. Once all
passages have been collected, we go through and merge overlapping
passages together, but this reduction in the number of passages is not
compensated for by re-adding the highest-scoring passages that were pushed
out of the queue by passages which have been merged away.

This commit increases the size of the priority queue to try and account for
overlapping passages that will subsequently be merged together.
2022-12-01 12:25:29 +00:00
Luca Cavanna bd168ac2a8 Add changes entry for #11985 2022-11-30 10:13:39 +01:00
Luca Cavanna 343d888b30
ExitableTerms to override getMin and getMax (#11985)
ExitableTerms should not iterate through the terms to retrieve min and max when the wrapped implementation has the values cached (e.g. FieldsReader, OrdsFieldReader)
2022-11-30 10:06:31 +01:00
Alan Woodward 0cc6f69536
Give OffsetsRetrievalStrategy implementations public constructors (#11983)
OffsetsFromMatchIterator and OffsetsFromPositions both have package-
private constructors, which makes them difficult to use as components in a
separate highlighter implementation.
2022-11-28 16:22:46 +00:00
Karl David Wright 5c4896321d Merge branch 'GITHUB-11883' into main
Pulling in changes to address ticket 11883.
2022-11-25 16:32:02 -05:00
Karl David Wright 74e8b94796 Fix for 11883. 2022-11-25 16:17:18 -05:00
Karl David Wright 6dc6b5b0dd As part of GITHUB-11883, develop new primitive Plane constructors to build boundary planes specific for each polygon edge. 2022-11-25 14:56:38 -05:00
Greg Miller 2e83c3b40f
Fix NPE in BinaryRangeFieldRangeQuery when field does not exist or is of wrong type (#11950) 2022-11-25 11:38:41 -08:00
Robert Muir 4e93f29318
fix bad shift amounts and enable check (#11979) 2022-11-25 11:47:25 -05:00
Robert Muir 545c93a394
fix use of wrong array toString() method in test, enable check (#11978) 2022-11-25 11:47:04 -05:00
Robert Muir 4885b5f856
fix use of wrong array equals() method in test, enable check (#11977) 2022-11-25 11:46:48 -05:00
Robert Muir f4286493d1
fix variable assigned to itself in test and enable check (#11980) 2022-11-25 11:45:45 -05:00
Karl David Wright b5f94b6754 Add test that tweaks identical planes in intersections bug 2022-11-25 07:40:45 -05:00
Karl David Wright b5dd71198d Refactor, restoring isWithinSection and making sure it is properly called. 2022-11-24 02:47:06 -05:00
Shubham Chaudhary b15ace46b2
Remove QueryTimeout#isTimeoutEnabled method and move check to caller (#11954)
Co-authored-by: Shubham <cshbha@amazon.com>
2022-11-24 16:37:20 +01:00
Adrien Grand 28576eb99d Fix precommit. 2022-11-24 11:44:21 +01:00
Simon Cooper 135f3fab41
Ensure collections are properly sized on creation (#11942)
A few other optimisations along the way
2022-11-24 11:20:04 +01:00
Karl David Wright 839dfb5a2d More refactoring work, and fix a distance calculation. 2022-11-23 23:36:15 -05:00
Karl David Wright 5e4623af1f For 11965, add structural changes that would allow intersection calls to also be O(log(n)). Disabled though because test failures are the result of enabling it - work ongoing. 2022-11-23 15:07:57 -05:00
Karl David Wright 482f8251ff More work related to 11965: Improve performance of nearestDistance queries somewhat by removing unnecessary code. 2022-11-23 12:21:38 -05:00
Adrien Grand 802774641a
Enforce VectorValues.cost() is equal to size(). (#11962)
`VectorValues` have a `cost()` method that reports an approximate number of
documents that have a vector, but also a `size()` method that reports the
accurate number of vectors in the field. Since KNN vectors only support
single-valued fields we should enforce that `cost()` returns the `size()`.
2022-11-23 11:05:00 +01:00
Adrien Grand 20c1ba5d9a
Remove VectorValues#EMPTY. (#11961)
This instance is illegal as it reports a number of dimensions equal to zero.
2022-11-23 10:52:12 +01:00
Adrien Grand 8bdc59ce67 Add back-compat indices for 9.4.2 2022-11-23 10:35:06 +01:00
Dawid Weiss 30873cfcd9 Fix the boxing issue again. 2022-11-23 08:29:12 +01:00
Karl David Wright 5fec8efe4e More tidying to make lint happy 2022-11-22 22:05:55 -05:00
Karl David Wright 49c8a75917 Resolve merge conflicts 2022-11-22 21:29:06 -05:00
Karl David Wright 0593eca73d Fix problem in new Plane code 2022-11-22 21:18:11 -05:00
Karl David Wright fc7ce76851 Refactor and make hierarchical GeoStandardPath. Some tests fail and will need to be researched further. 2022-11-22 21:18:11 -05:00
Karl David Wright 1ded41ea20 Final bugs fixed, except remaining legacy issue with nearest distance in GeoDegeneratePath. 2022-11-22 21:12:44 -05:00
Karl David Wright c9c27c755a Make sure use of aggregation form is consistent throughout, and fix segment endpoint computations of nearestDistance. 2022-11-22 18:43:19 -05:00
Karl David Wright ae5179986c All tests fixed saved two - distance related. 2022-11-22 16:56:34 -05:00
Adrien Grand 7b7cb396e5 Tidy. 2022-11-22 18:57:21 +01:00
Adrien Grand 750e7dba32 Add bugfix version 9.4.2 2022-11-22 18:56:30 +01:00
Karl David Wright 799421abba Fix nearestDistance for real this time 2022-11-22 12:55:18 -05:00
Peter Gromov 2ae8dd632e
hunspell: support empty dictionaries, adapt to the hunspell/C++ repo changes (#11960)
hunspell: support empty dictionaries, adapt to the hunspell/C++ repo changes
2022-11-22 18:23:45 +01:00
Mike McCandless ad04ac1bc4 tidy up 2022-11-22 08:20:30 -05:00
Mike McCandless 4bd4f8b521 remove unused imports 2022-11-22 08:18:41 -05:00
Mike McCandless acbc08fb32 expand wildcard imports 2022-11-22 08:03:54 -05:00
Mike McCandless 521c0e24f2 #10878: revert #02528c6757d10420cc7d545282b49c4322943ac7 (add some test verbosity on failure (#11935)) 2022-11-22 07:59:08 -05:00
Stefan Vodita 369a70f289
Support deletions in rearrange (#11815)
* Support deletions in rearrange 
* Store BinaryDocValues in the binary doc value selector as ByteRef
   instead of String.
2022-11-21 23:52:38 -08:00
Karl David Wright 5d341e9d8c Cleanup StandardGeoPath to get rid of unused member arrays 2022-11-21 20:36:25 -05:00
Karl David Wright ecf4396ef7 Remove a dated test 2022-11-21 20:01:30 -05:00
Karl David Wright d85c35e4d7 Fix problem in new Plane code 2022-11-21 19:15:10 -05:00
Karl David Wright 20fcd0b757 Refactor and make hierarchical GeoStandardPath. Some tests fail and will need to be researched further. 2022-11-21 18:39:50 -05:00
Alan Woodward 332679886c
Add field as a separate input to newSynonymQuery (#11941)
QueryBuilder#newSynonymQuery takes an array of TermAndBoost objects as a
parameter and uses the field of the first term in the array as its field. However,
there are cases where this array may be empty, which will result in an
ArrayOutOfBoundsException.

This commit reworks QueryBuilder so that TermAndBoost contains plain
BytesRefs, and passes the field as a separate parameter. This guards against
accidental calls to newSynonymQuery with an empty list - in this case, an
empty synonym query is generated rather than an exception. It also
refactors SynonymQuery itself to hold BytesRefs rather than Terms, which
needlessly repeat the field for each entry.

Fixes #11864
2022-11-21 09:55:14 +00:00
Karl David Wright cd82a9bbdc Revert the change the relies on accurate bounds from path components. This caused randomized test failures, and fixing the bounds caused other (inexplicable) test failures. More research needed. 2022-11-20 10:32:14 -05:00
Robert Muir 8fdac2d88e
fixed boxed equality check to unbreak the build: has this code been tested???? 2022-11-19 23:11:30 -05:00
Karl David Wright 1e236090af Fix up formatting 2022-11-19 17:56:10 -05:00
Karl David Wright 9bca7a70e1 Fix longstanding bug in path bounds calculation, and hook up efficient isWithin() and distance logic 2022-11-19 17:56:09 -05:00
Karl David Wright fbdb655221 Add node structures and fast operations for them. 2022-11-19 17:56:08 -05:00
Robert Muir 62f2b42502
Prevent TestStressIndexing from taking minutes for normal non-NIGHTLY runs (#11953)
This test intentionally does a ton of filesystem operations: currently
about 20% of the time you can get really unlucky and get virus checker
simulated, against a real filesystem, which makes things really slow.

Instead use a ByteBuffersDirectory for local runs so that it doesn't
take minutes. The test can still be pretty slow even with this
implementation, so tone down the runtime so that it takes ~ 1.5s
locally.
2022-11-19 18:06:52 -05:00
Dawid Weiss 3f6410b738
Implement source code regeneration for test-framework perl scripts (#11952) 2022-11-19 23:40:45 +01:00
Karl David Wright 718ae33e21 Merge branch 'main' of https://gitbox.apache.org/repos/asf/lucene into main 2022-11-18 02:04:54 -05:00
Dawid Weiss 2f21a866c1
Add star import check/validation (#11949)
* Remove some old cruft that only slows down checks. Add star import check
* Expand wildcard imports to comply with the rule.
2022-11-18 16:42:59 +01:00
Karl David Wright 492a25a2a2 Fix formatting 2022-11-18 01:41:53 -05:00
Karl David Wright f3b4b057fa Refactor, in preparation for using b-trees to enhance performance. 2022-11-18 00:32:52 -05:00
Jack Conradson a18b62ded4
Decrease test time for TestManyKnnDocs.testLargeSegment (#11945)
* Improve speed of TestManyKnnDocs
2022-11-16 23:52:32 -05:00
Karl David Wright b6ebfd1861 Prevent NPEs while still handling the polar case for horizontal planes right off the pole 2022-11-16 11:03:24 -05:00
Mao Suhan 3c5bcb383b
fix bug of incorrect cost after upgradeToBitSet in DocIdSetBuilder class (#11939) 2022-11-16 17:04:15 +01:00
Robert Muir c5da727493
fix overflows in compound assignments (#11938)
* Count points as longs.

* Simplify KnnVectorsWriter.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-11-16 10:53:34 -05:00
Michael McCandless 02528c6757
#10878: add some test verbosity on failure (#11935) 2022-11-15 14:20:04 -05:00
Adrien Grand e2e14df5ac Add changes entry for #11930. 2022-11-15 15:31:47 +01:00
Adrien Grand 729dc2bb82
Introduce IOContext.LOAD (#11930)
The default codec has a number of small and hot files, that actually used to be
fully loaded in memory before we moved them off-heap. In the general case,
these files are expected to fully fit into the page cache for things to work
well. Should we give control over preloading to codecs? This is what this
commit does for the following files:
 - Terms index (`tip`)
 - Points index (`kdi`)
 - Stored fields index (`fdx`)
 - Terms vector index (`tvx`)

This only has an effect on `MMapDirectory`.
2022-11-15 13:59:51 +01:00
Robert Muir 1b9d98d6ec
enable error-prone "narrow calculation" check (#11923)
This check finds bugs such as https://github.com/apache/lucene/pull/11905.

See https://errorprone.info/bugpattern/NarrowCalculation
2022-11-15 06:42:41 -05:00
Lu Xugang e5426dbbd2
count() in BooleanQuery could be early quit (#11895)
* Count() in BooleanQuery could be early quit if queries are pure disjunctional
2022-11-15 17:57:51 +08:00
Adrien Grand 69a7cb22e7
More granular control of preloading on MMapDirectory. (#11929)
This enables configuring preloading on MMapDirectory based on the file name as well as the IOContext that is used to open the file.
2022-11-15 10:52:49 +01:00
Uwe Schindler 0741a354c0
Improve test introduced in #11918 to also check that reported invalid position is transformed back original position by slicing code (#11926) 2022-11-13 15:34:10 +01:00
Uwe Schindler 98b26e0885 fix merge problem in CHANGES.txt 2022-11-11 17:36:08 +01:00
Uwe Schindler 57ac311c70
Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput (#11918)
Port generic exception handling from MemorySegmentIndexInput to ByteBufferIndexInput. This also adds the invalid position while seeking or reading to the exception message.
2022-11-11 16:47:52 +01:00
Uwe Schindler 2a68f282f4 Synchronize changelog with 9.4 branch so we do not have duplicates 2022-11-11 16:36:02 +01:00
Peter Gromov 6fbc5f73c3
hunspell: introduce FragmentChecker to speed up ModifyingSuggester (#11909)
hunspell: introduce FragmentChecker to speed up ModifyingSuggester

add NGramFragmentChecker to quickly check whether insertions/replacements produce strings that are even possible in the language

Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
2022-11-11 12:13:47 +01:00
Benjamin Trent c8d44acf20
Follow up to GITHUB#11916, remove deleted docs check (#11919) 2022-11-10 18:40:24 -05:00
Benjamin Trent 3a506ec87a
GITHUB#11911: improve checkindex to be more thorough for vectors (#11916)
search every N docs to get close to 64 tests
2022-11-10 16:45:47 -05:00
Benjamin Trent 1360baaee9
Fix integer overflow when seeking the vector index for connections (#11905)
* Fix integer overflow when seeking the vector index for connections
* Adding monster test to cause overflow failure
2022-11-10 08:24:32 -05:00
Peter Gromov f7417d5961
hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in) (#11893)
hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
2022-11-09 08:20:50 +01:00
Greg Miller c66a559050
Further optimize DrillSideways scoring (#11881) 2022-11-08 10:08:12 -08:00
Benjamin Trent f9c26ed501
Fix latent casting bug in BKDWriter (#11907) 2022-11-08 15:55:07 +01:00
Peter Gromov 682e5c94e8
[hunspell] speed up WordFormGenerator (#11904) 2022-11-07 19:41:17 +01:00
Lu Xugang a8120bcb32
Simplify the logic of matchAll() in IndexSortSortedNumericDocValuesRangeQuery (#11884)
* Simplify the logic of matchAll() in IndexSortSortedNumericDocValuesRangeQuery
2022-11-07 19:09:52 +08:00
Michael Sokolov 48aad5090f
#11896: reduce top k in test to avoid split-graph (#11899) 2022-11-04 09:30:46 -04:00
Nhat Nguyen 1a5ad61b9d
Document that bulkScorer method can return null (#11897)
Like Weight#scorer, we should warn users that Weight#bulkScorer can 
return null if the query matches no documents.
2022-11-02 15:12:43 -07:00
Robert Muir 4e207fed62
Tone down TestDocumentsWriterStallControl.testRandom, so it does not take minutes (#11894)
This test often takes several minutes with normal runs (no NIGHTLY/multiplier/etc). Tone it down so that it isn't slow: CI builds can work it harder by passing those parameters
2022-11-02 12:17:15 -04:00
Peter Gromov 419ffd3974 [hunspell] perform a bit fewer checks after 2 suffixes have been removed 2022-10-31 10:09:54 +01:00
Marios Trivyzas 3210a42f09
Fix nanos to millis conversion for tests (#11856) 2022-10-29 09:05:17 +02:00
Navneet Verma e7253f112d
Add interface to relate a LatLonShape with another shape represented as Component2D. (#11753)
Adds createLatLonShapeDocValues and createXYShapeDocValues factory methods
to LatLonShape and XYShape factory classes, respectively.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-10-28 13:52:20 -05:00
Dawid Weiss 5c7edd7f38
Upgrade to gradle 7.5.1 (excluding launch scripts, which we have customized) (#11886) 2022-10-28 08:49:36 +02:00
Marc D'Mello 2793256682
GITHUB#11795: Add FilterDirectory to track write amplification factor (#11796)
* LUCENE-11795: Add FilterDirectory to track write amplification factor

* addressed feedback

* added optional temp output tracking and real time tracking

* addressed more feedback

* more improvements + added CHANGED.txt entry

* format edit to CHANGES.txt

* remove waf factor calculation

Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
2022-10-27 15:07:56 -04:00
Michael Sokolov b3bc59910f
When evaluating expressions, defer calling advanceExact on operands until doubleValue() is called (#11878) 2022-10-26 14:05:39 -04:00
gf2121 05bd83dfe1
Use ByteArrayComparator for PointInSetQuery#MergePointVisitor (#11876) 2022-10-26 13:39:32 +08:00
gf2121 b1d1e488f2
Move LUCENE-10376 CHANGES entry to 10.0.0 (#11871) 2022-10-24 22:39:21 +08:00
iverase 976a38baa0 Add back-compat indices for 9.4.1 2022-10-24 15:20:44 +02:00
iverase 9ce6268cce Add bugfix version 9.4.1 2022-10-24 15:13:12 +02:00
gf2121 8cfbc18497
LUCENE-10376: Roll up the loop in vint/vlong in DataInput (#602) 2022-10-24 17:39:22 +08:00
Julie Tibshirani 0f525bfb14
Fix Lucene94HnswVectorsFormat validation on large segments (#11861)
When reading large segments, the vectors format can fail with a validation
error:

java.lang.IllegalStateException: Vector data length 3070061568 not matching
size=999369 * dim=768 * byteSize=4 = -1224905728

The problem is that we use an integer to represent the size, which is too small
to hold it. The bug snuck in during the work to enable int8 values, which
switched a long value to an int.
2022-10-19 13:49:59 -07:00
Patrick Zhai 6cde41c9fd
GITHUB-11838 Change API to allow concurrent query rewrite (#11840)
Replace Query#rewrite(IndexReader) with Query#rewrite(IndexSearcher)
2022-10-19 09:49:40 -07:00
Peter Gromov 05971b3315
hunspell: speed up GeneratingSuggester by not deserializing non-suggestible roots (#11859) 2022-10-19 13:17:43 +02:00
Steven Schlansker f3d85be476
PrimaryNode: add configurable timeout to waitForAllRemotesToClose (#11822) 2022-10-18 17:21:01 -07:00
Adrien Grand 2ed16c7846 Revert "Binary search the entries when all suffixes have the same length in a leaf block. (#11722)"
This reverts commit 3adec5b1ce.
2022-10-18 14:27:02 +02:00
zhouhui 3adec5b1ce
Binary search the entries when all suffixes have the same length in a leaf block. (#11722) 2022-10-18 11:07:52 +02:00
Benjamin Trent cd5e200f47
Fix failure to load larger data sets in KnnGraphTest (#11849)
When running the `reindex` task with KnnGraphTester, exceptionally large
datasets can be used. Since mmap is used to read the data, we need to know the
buffer size. This size is limited to Integer.MAX_VALUE, which is inadequate for
larger datasets.

So, this commit adjusts the reading to only read a single vector at a time.
2022-10-17 16:39:58 -07:00
Peter Gromov 2958f2ae9d
hunspell: speedup suggestions by caching speller and compound stemming requests (#11857)
hunspell: speed up suggestions by caching speller and compound stemming requests
2022-10-17 21:25:12 +02:00
Zach Chen 21e3f654fb
LUCENE-10635: Ensure test coverage for WANDScorer by using a test query (#1039) 2022-10-15 13:02:02 -07:00
Robert Muir ece8ea715c
Fix ExitableDirectoryReader sampling constants to be power-of-2 (#11850)
If it's performance sensitive enough that we should do sampling, then we should avoid integer division too.
2022-10-15 12:05:15 -04:00
Benjamin Trent a7369d7f59
Remove cancellation check on every vector (#11843)
We recently introduced support for kNN vectors to `ExitableDirectoryReader`.
Previously, we checked for cancellation not only on sampled calls `advance`,
but on every single call to `vectorValue`. This can cause significant overhead
when a query scans many vector values (for example the case where you're doing
an exact scan and computing a vector similarity for every matching document).

This PR removes the cancellation checks on `vectorValue`, since having them on
`advance` is already enough.
2022-10-13 09:29:33 -07:00
Marc D'Mello 3a608995a1
GITHUB-11761 (part 2): Fix unit tests to cleany work with new TierMergePolicy delete pct default (#11841)
Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
2022-10-13 15:18:50 +02:00
Robert Muir 5e26b36ac8
Mark TestLongBitSet.testHugeCapacity @Monster as it requires a lot of memory (#11844)
Closes #11842
2022-10-13 07:20:21 -04:00
Peter Gromov ab50fe640b [hunspell] fix TestPerformance measurement after millis->nanos conversion 2022-10-12 11:29:07 +02:00
Marc D'Mello d966adcb62
GITHUB-11761: Move minimum TieredMergePolicy delete percentage and change default value (#11831)
Move minimum TieredMergePolicy delete percentage from 20% to 5%

and change deletePctAllowed default to 20%

Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
2022-10-05 15:33:12 -07:00
Alan Woodward 6bd8733fdb
No need to rewrite queries in unified highlighter (#11807)
Since QueryVisitor added the ability to signal multi-term queries, the query rewrite
call in UnifiedHighlighter has been essentially useless, and with more aggressive
rewriting this is now causing bugs like #11490. We can safely remove this call.

Fixes #11490
2022-10-03 10:15:40 +01:00
Uwe Schindler aae293437f
Upgrade forbiddenapis to 3.4 (#11834) 2022-10-02 16:42:36 +02:00
Uwe Schindler 7333f0329b Fix typo in log message (we only support exactly Java 19) 2022-10-02 11:09:58 +02:00
Greg Miller 44b4602776
TermInSetQuery optimization when all docs in a field match a term (#11828) 2022-09-29 06:59:59 -07:00
Greg Miller 367cd2ea95 Associate correct PR with DrillSideway change in CHANGES 2022-09-29 05:48:29 -07:00
Greg Miller d02ba3134f
DrillSideways optimizations (#11803)
DrillSidewaysScorer now breaks up first- and second-phase matching and makes use of advance when possible over nextDoc.
2022-09-29 05:22:30 -07:00
Ignacio Vera 78b58b8e2e
Build SpatialVisitor once per index (#11825)
Address a performance regression on polygon queries using LatLonPoint field.
2022-09-27 10:51:49 +02:00
Greg Miller 971ae01164
Fix tie-break bug in various Facets implementations (#11768) 2022-09-26 15:05:57 -07:00
Greg Miller 734841d6c0
Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment. (#11738) 2022-09-26 10:39:47 -07:00
Greg Miller ac12cd9f17
FacetsCollector#collect is no longer final to allow extension (#11804) 2022-09-26 10:15:31 -07:00
Uwe Schindler d943b76215 GITHUB-912: Remove deprecated APIs; fix link 2022-09-26 18:36:09 +02:00
Uwe Schindler 3b9c728ab5
MR-JAR rewrite of MMapDirectory with JDK-19 preview Panama APIs (>= JDK-19-ea+23) (#912)
This uses Gradle's auto-provisioning to compile Java 19 classes and build a multi-release JAR from them. Please make sure to regenerate gradle.properties (delete it) or change "org.gradle.java.installations.auto-download" to "true"
2022-09-26 15:22:04 +02:00
Adrien Grand 432296d967
Fix codec name in index header for Lucene94FieldInfosFormat. (#11818) 2022-09-26 14:56:30 +02:00
Dawid Weiss 6b82be5f11
Regenerate sources after dependency updates. (#11817) 2022-09-25 18:09:30 +02:00
Dawid Weiss 5d121ce44c
Upgrade several build dependencies. (#11812)
* Upgrade several build dependencies.

* Update error prone rules (those are off but they do trigger warnings/ errors)

* A few corrections I made before I turned off new warnings. Let's do nother issue to fix them.
2022-09-25 17:10:22 +02:00
Robert Muir 15f3743f02
Remove Operations.isFinite (#11813)
This method is recursive: to avoid eating too much stack we apply a
small limit. This means it can't really be used on any largish automata
without hitting exception.

But the benefit of knowing finite vs infinite in AutomatonTermsEnum is
minor: let's not auto-compute this. FuzzyQuery still gets the finite
optimization because its finite by definition. PrefixQuery is always
infinite. Wildcard/Regex just assume infinite which is safe to do.

Remove the auto-computation and the "trillean" Boolean parameter. If you
dont know that your automaton is finite, pass false to
CompiledAutomaton, it is safe.

Move this method to AutomatonTestUtil so we can still use it in test
asserts.

Closes #11809
2022-09-24 10:51:04 -04:00
Dawid Weiss 54fba99cb1
Upgrade google java format and apply tidy (#11811) 2022-09-24 15:40:27 +02:00
Dawid Weiss 8bdfa90ea9 Fix and simplify the test (#11734). 2022-09-24 12:51:01 +02:00
Alan Woodward 188a78d769
Don't try to highlight very long terms (#11808)
The UnifiedHighlighter can throw exceptions when highlighting terms that are longer
than the maximum size the DaciukMihovAutomatonBuilder accepts. Rather than throwing
a confusing exception, we can instead filter out the long terms when building the
MemoryIndexOffsetStrategy. Very long terms are likely to be junk input in any case.
2022-09-24 11:26:16 +01:00
Luke Kot-Zaniewski 3a04aa44c2
Fix repeating token sentence boundary bug (#11734)
Signed-off-by: lkotzaniewsk <lkotzaniewsk@bloomberg.net>
Co-authored-by: Dawid Weiss <dawid.weiss@gmail.com>
2022-09-23 12:59:46 +02:00
jianping weng 5b24a233bd
LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search (#687) 2022-09-22 08:51:13 +02:00
Shai Erera bcc116057d
Minor refactoring and cleanup to taxonomy index code (#11775) 2022-09-21 13:08:33 +03:00
Julie Tibshirani add309bb40 Mute TestKnnVectorQuery#testFilterWithSameScore while we work on a fix 2022-09-20 15:48:56 -07:00
Luca Cavanna 4eaebee686
Guard FieldExistsQuery against null pointers (#11794)
FieldExistsQuery checks if there are points for a certain field, and then retrieves the
corresponding point values. When all documents that had points for a certain field have
been deleted from a certain segments, as well as merged away, field info may report
that there are points yet the corresponding point values are null.

With this change we add a null check in FieldExistsQuery. Long term, we will likely want
to prevent this situation from happening.

Relates #11393
2022-09-20 15:38:38 +02:00
Adrien Grand 6c46662b43
Fix handling of ghost fields in string sorts. (#11792)
Introduction of dynamic pruning for string sorts (#11669) introduced a bug with
string sorts and ghost fields, triggering a `NullPointerException` because the
code assumes that `LeafReader#terms` is not null if the field is indexed
according to field infos.

This commit fixes the issue and adds tests for ghost fields across all sort
types.

Hopefully we can simplify and remove the null check in the future when we
improve handling of ghost fields (#11393).
2022-09-20 13:49:52 +02:00
Ignacio Vera ecb0ba542b
Improve tessellator performance by delaying calls to the method #isIntersectingPolygon (#11786) 2022-09-20 07:15:38 +02:00
Michael Sokolov 07af358f90
Diversity check bugfix (#11781)
* Fixes bug in HNSW diversity checks introduced in LUCENE-10577
2022-09-19 11:48:59 -04:00
Michael Sokolov e69c48b8d9 Fix rare bug in TestKnnVectorQuery when we have multiple segments 2022-09-18 20:21:39 +00:00
Namgyu Kim 451bab300e
GITHUB#11778: Add detailed part-of-speech tag for particle and ending on Nori (#11779) 2022-09-17 00:42:35 +09:00
Adrien Grand 155876a902 LUCENE-10674: Move changes entry to 9.4. 2022-09-16 16:59:42 +02:00
Dawid Weiss 9acc653995
GH-11172: remove WindowsDirectory and native subproject. (#11774) 2022-09-15 16:22:46 +02:00
John Mazanec 0587844742
LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts. (#1068)
Signed-off-by: John Mazanec <jmazane@amazon.com>
2022-09-15 11:21:39 +02:00
Alexander Münch 5de685cfba
Removed duplicate check in SpanGradientFormatter (#11762) 2022-09-14 13:37:31 +01:00
Adrien Grand a426c6fec3 Fix integer overflow in tests. 2022-09-13 17:08:17 +02:00
Greg Miller 4463a0b271
GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now returns top children instead of all children (#11764) 2022-09-13 06:50:52 -07:00
Dhiru Kholia 30b72ec364
Fix a typo affecting Luke (#11763) 2022-09-12 13:05:40 +02:00
Alan Woodward 41d03f69ce
Fix IntervalBuilder.NO_INTERVALS docId when unpositioned (#11760)
IntervalBuilder.NO_INTERVALS should return -1 when unpositioned,
not NO_MORE_DOCS. This can trigger exceptions when an empty
IntervalQuery is combined in a conjunction.

Fixes #11759
2022-09-09 17:19:15 +01:00
Mayya Sharipova 0ea8035612
LUCENE-10592 Better estimate memory for HNSW graph (#11743)
Better estimate memory used for OnHeapHnswGraph,
as well as add tests.

Also don't overallocate arrays in NeighborArray

Relates to #992
2022-09-08 16:54:29 -04:00
Yuting Gan 49b596ef02
Added a top-n range faceting example (#1035) 2022-09-08 12:19:42 -07:00
Julie Tibshirani 09a13aeaf2
LUCENE-10577: Remove LeafReader#searchNearestVectorsExhaustively (#11756)
This PR removes the recently added function on LeafReader to exhaustively search
through vectors, plus the helper function KnnVectorsReader#searchExhaustively.
Instead it performs the exact search within KnnVectorQuery, using a new helper
class called VectorScorer.
2022-09-08 12:15:02 -07:00
Robert Muir f4146a44e9
Fix TestIndexWriterOnDiskFull.testAddDocumentOnDiskFull to handle IllegalStateException from startCommit() (#11757)
If ConcurrentMergeScheduler is used, and the merge hits fatal exception (such as disk full) after prepareCommit()'s ensureOpen() check, then startCommit() will throw IllegalStateException instead of AlreadyClosedException.

The test is currently not prepared to handle this: the logic is only geared around exceptions coming from addDocument()

Closes #11755
2022-09-08 13:35:54 -04:00
Adrien Grand f8285fd0fe
Prevent term vectors from exceeding the maximum dictionary size. (#11726)
When indexing term vectors for a very large document, the automatic computation
of the dictionary size based on the overall size of the block might yield a
size that exceeds the maximum window size that is supported by LZ4. This commit
addresses the issue by automatically taking the minimum of the result of this
computation and the maximum window size (64kB).
2022-09-08 13:44:21 +02:00
Marios Trivyzas dbffe3472b
LUCENE-10423: Remove usages of System.currentTimeMillis() from tests (#11749)
* Remove usages of System.currentTimeMillis() from tests

- Use Random from `RandomizedRunner` to be able to use a Seed to
  reproduce tests, instead of a seed coming from wall clock.
- Replace time based tests, using wall clock to determine periods
  with counter of repetitions, to have a consistent reproduction.

Closes: #11459

* address comments

* tune iterations

* tune iterations for nightly
2022-09-06 17:55:01 -04:00
Greg Miller 84cae4f27c
Simplify dense optimization check in TermInSetQuery (#11737) 2022-09-02 07:51:29 -07:00
Greg Miller 202dd809bd Ensure TermInSetQuery ScoreSupplier never returns null Scorer 2022-09-01 15:31:14 -07:00
Greg Miller 680f21dca5
LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in IndexOrDocValuesQuery (#1058) 2022-09-01 14:04:43 -07:00
Michael Sokolov 0462a0ad73 fixed index order needed for TestKnnVectorQuery.testScoreEuclidean (#11732) 2022-09-01 09:53:57 -04:00
Michael Sokolov 1649964f07 Forward-port CHANGES entry for quantized HNSW vectors from 9.x branch 2022-09-01 09:53:46 -04:00
Mayya Sharipova 554fabf682
LUCENE-10633 Disable sort optimization for SortedSetSortField (#3125)
Add ability to SortedSetSortField to disable sort optimization
2022-08-30 16:52:28 -04:00
Michael Sokolov 61ef031f7f
SimpleText knn vectors; fix searchExhaustively and suppress a byte format test case (#11725) 2022-08-29 11:49:52 -04:00
Dawid Weiss 4f7543725c
#11720 Upgrade randomizedtesting to 2.8.1 (#11721) 2022-08-26 00:01:57 +02:00
Mike Drob dbc7a9764a
Add Integer awareness to RamUsageEstimator.sizeOf (#11715)
Additionally, update comments to reflect that we have not been VM cache-aware for a long time now.
2022-08-25 15:18:08 -05:00
Uwe Schindler 1d54299011
Fix classloading deadlock in analysis factories / AnalysisSPILoader initialization. This closes #11701 (#11718) 2022-08-25 18:16:04 +02:00
Greg Miller 1529606763
Optimize TermInSetQuery for terms that match all docs in a segment (#1062) 2022-08-23 08:37:44 -07:00
Michael Sokolov 8021c2db4e Don't throw an exception for byte-encoded vectors in SimpleText codec 2022-08-22 08:29:58 -04:00
Julie Tibshirani df67223497 Disable byte encoding in TestSimpleTextKnnVectorsFormat 2022-08-21 17:00:57 -07:00
Julie Tibshirani 653d2ebf71
Remove KnnVectorsFormat#currentVersion (#1077)
These internal versions only make sense within a codec definition, and aren't
meant to be exposed and compared across codecs. Since this method is only used
in tests, we can move the check to the test classes instead.
2022-08-21 13:09:07 -07:00
Michael Sokolov daa56d30f0 Fix TestHnswGraph rare failure 2022-08-20 17:26:50 -04:00
Michael Sokolov 0a58318e16
Fix for bad cast when sorting a KnnVectors index over BytesRef (#1074) 2022-08-20 17:23:47 -04:00
Michael Sokolov 798c02dd70
fix VectorUtil.dotProductScore normalization (#1073) 2022-08-20 09:15:38 -04:00
Michael Sokolov 60fa19d509
don't call BitSet.cardinality() more than needed (#1075) 2022-08-20 08:40:50 -04:00
Michael Sokolov f9680c6807
Add safety checks to KnnVectorField; fixed issue with copying BytesRef (#1076) 2022-08-20 08:38:42 -04:00
Julie Tibshirani 8308688d78
LUCENE-9583: Remove RandomAccessVectorValuesProducer (#1071)
This change folds the `RandomAccessVectorValuesProducer` interface into
`RandomAccessVectorValues`. This reduces the number of interfaces and clarifies
the cloning/ copying behavior.

This is a small simplification related to LUCENE-9583, but does not address the
main issue.
2022-08-19 18:04:05 -07:00
Yuting Gan 0914b537db
LUCENE-10644: Facets#getAllChildren testing should ignore child order (#1013) 2022-08-18 10:38:49 -07:00
Julie Tibshirani 7912ed02c4 Move Lucene91HnswGraphBuilder to test folder
It's only used in unit tests so it can live in the backwards_codecs tests.
2022-08-17 17:10:38 -07:00
Michael Sokolov bc214d4958 standardize exception text for vector dimension mismatch (in SimpleText codec) 2022-08-13 13:12:11 -04:00
Nick Knize 543910d900
LUCENE-10654: Fix ShapeDocValue Bounding Box failure (#1066)
The base spatial test case may create invalid self crossing polygons. These
polygons are cleaned by the tessellator which may result in an inconsistent
bounding box between the tessellated shape and the original, invalid, geometry.
This commit fixes the shape doc value test case to compute the bounding box from
the cleaned geometry instead of relying on the, potentially invalid, original
geometry.

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-12 10:54:22 -05:00
Ignacio Vera fe8d11254a
LUCENE-10678: Fix potential overflow when computing the partition point on the BKD tree (#1065)
We currently compute the partition point for a set of points by multiplying the number of nodes that needs to be on
 the left of the BKD tree by the maxPointsInLeafNode. This multiplication is done on the integer space so if the partition point is bigger than Integer.MAX_VALUE it will overflow. This commit moves the multiplication to the long space so it doesn't overflow.
2022-08-11 15:25:53 +02:00
Michael Sokolov a693fe819b
LUCENE-10577: enable quantization of HNSW vectors to 8 bits (#1054)
* LUCENE-10577: enable supplying, storing, and comparing HNSW vectors with 8 bit precision
2022-08-10 17:09:07 -04:00
Vigya Sharma 59a0917e25
Fix typo in PostingsReaderBase docstring (#948)
* remove extra PostingsEnum from docstring
* add ImpactsEnum to docstring
2022-08-09 16:20:51 -07:00
Nick Knize d7fd48c950
LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape (#1017)
Adds new doc value field to support LatLonShape and XYShape doc values. The
implementation is inspired by ComponentTree. A binary tree of tessellated
components (point, line, or triangle) is created. This tree is then DFS
serialized to a variable compressed DataOutput buffer to keep the doc value
format as compact as possible.

DocValue queries are performed on the serialized tree using a similar component
relation logic as found in SpatialQuery for BKD indexed shapes. To make this
possible some of the relation logic is refactored to make it accessible to the
doc value query counterpart.

Note this does not support the following:

* Multi Geometries or Collections - This will be investigated by exploring 
  the addition of multi binary doc values.
* General Geometry Queries - This will be added in a follow on improvement. 

Signed-off-by: Nicholas Walter Knize <nknize@apache.org>
2022-08-09 12:51:45 -05:00
tang donghai b08e34722d
LUCENE-10646: Add some comment on LevenshteinAutomata (#1016)
* add Comment on Lev & pretty the toDot

* use auto generate scripts to add comment

* update checksum

* update checksum

* restore toDot

* add removeDeadStates in levAutomata

Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2022-08-07 10:01:30 -04:00
Ignacio Vera bd0718f071
LUCENE-10673: Improve check of equality for latitudes for spatial3d GeoBoundingBox (#1056) 2022-08-04 06:47:27 +02:00
luyuncheng 34154736c6
LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data (#987) 2022-08-01 18:34:41 +02:00
Adrien Grand 04e4f317cb LUCENE-10629: Fix NullPointerException.
I hit a NPE while running tests. `Weight#scorer` may return `null`, but not
`Scorer#iterator`.
2022-08-01 14:13:22 +02:00
Shai Erera 7ac75135b9
[LUCENE-10629]: Add fast match query support to FacetSets (#1015) 2022-07-31 07:50:03 +03:00
Dawid Weiss f93e52e5bb
LUCENE-10669: The build should be more helpful when generated resources are touched (#1053) 2022-07-30 20:45:32 +02:00
Adrien Grand 7c9d3cd6ff LUCENE-10633: Fix handling of missing values in reverse sorts. 2022-07-29 21:36:35 +02:00
Kaival Parikh 1ad28a3136
LUCENE-10559: Add Prefilter Option to KnnGraphTester (#932)
Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be
able to compare pre and post-filtering benchmarks.

`filterSelectivity` expresses the selectivity of a filter as proportion of
passing docs that are randomly selected. We store these in a FixedBitSet and
use this to calculate true KNN as well as in HNSW search.

In case of post-filter, we over-select results as `topK / filterSelectivity` to
get final hits close to actual requested `topK`. For pre-filter, we wrap the
FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery.
2022-07-29 11:21:34 -07:00
Adrien Grand eb7b7791ba
LUCENE-10633: Dynamic pruning for sorting on SORTED(_SET) fields. (#1023)
This commit enables dynamic pruning for queries sorted on SORTED(_SET) fields
by using postings to filter competitive documents.
2022-07-29 11:12:32 +02:00
iverase e1d2005df4 Add back-compat indices for 9.3.0 2022-07-29 10:13:20 +02:00
Greg Miller 4ebc249dbc
Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto (#1020) 2022-07-28 11:12:21 -07:00
Shiming Li bb752c774c
LUCENE-10663: Fix KnnVectorQuery explain with multiple segments (#1050)
If there are multiple segments. KnnVectorQuery explain has a bug in locating
the doc ID. This is because the doc ID in explain is the docBase without the
segment.  In KnnVectorQuery.DocAndScoreQuery docs docid is increased in each
segment of the docBase. So, in the 'DocAndScoreQuery.explain', needs to be
added with the segment's docBase. 

Co-authored-by: Julie Tibshirani <julietibs@apache.org>
2022-07-28 10:31:49 -07:00
Adrien Grand 0ff987562a LUCENE-10661: Move CHANGES entry to 9.4. 2022-07-27 16:20:20 +02:00
luyuncheng 107747f359
LUCENE-10661: Reduce memory copy in BytesStore (#1047) 2022-07-27 16:17:08 +02:00
Weiming Wu 2cf12b8cdc
Cache decoded bytes for TFIDFSimilarity scorer. (#1042)
Co-authored-by: Weiming Wu <wweiming@amazon.com>
2022-07-26 13:47:52 +02:00
tang donghai 94960a0aff
precompute maxlevel in LogMergePolicy (#1045) 2022-07-26 13:42:32 +02:00
Mayya Sharipova 2efc204a39 LUCENE-10592 Strengthen TestHnswGraph::testSortedAndUnsortedIndicesReturnSameResults
This test occasionally fails if knn search returns only 1 document
in the index, as we have an assertion that returned doc IDs from
sorted and unsorted index must be different.

This patch ensures that we have many documents in the index, so
that knn search always returns enough results.
2022-07-25 09:48:43 -04:00
Greg Miller f943a57ebe Fix another TestDisiPriorityQueue bug 2022-07-22 14:32:08 -07:00
Mayya Sharipova bd06cebfc2 Add change log for LUCENE-10592 2022-07-22 12:14:58 -04:00
Mayya Sharipova fdbb76a8d7 Add next minor version 9.3.0 2022-07-22 12:01:08 -04:00
Mayya Sharipova ba4bc04271
LUCENE-10592 Build HNSW Graph on indexing (#992)
Currently, when indexing knn vectors, we buffer them in memory and
on flush during a segment construction we build an HNSW graph.
As building an HNSW graph is very expensive, this makes flush
operation take a lot of time. This also makes overall indexing
performance quite unpredictable – some indexing operations return
almost instantly while others that trigger flush take a lot of time.
This happens because flushes are unpredictable and trigged
by memory used, presence of concurrent searches etc.

Building an HNSW graph as we index documents avoid these problems,
as the load of HNSW graph construction is spread evenly during indexing.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-07-22 11:29:28 -04:00
Mayya Sharipova bd360f9b3e
Create Lucene94 Codec and move Lucene92 to backwards_codecs (#1041) 2022-07-22 10:04:10 -04:00
Michael Sokolov 6bdeb141b7 Revert "Create Lucene93 Codec and move Lucene92 to backwards_codecs (#924)"
This reverts commit f4f4a159b7.
2022-07-21 12:52:42 -04:00
Vigya Sharma 25a842d871
LUCENE-10583: Add docstring warning to not lock on Lucene objects (#963)
* add locking warning to docstring

* git tidy
2022-07-21 06:35:17 -04:00
Greg Miller 1bc38b7d1f Fix TestDisiPriorityQueue test bug 2022-07-20 11:33:14 -07:00
Lu Xugang 39e7597f6e
LUCENE-10656: It is unnecessary that using `limit` to check boundary (#1027) 2022-07-20 10:00:06 +08:00
Zach Chen 28ce8abb51
LUCENE-10480: Use BulkScorer to limit BMMScorer to only top-level disjunctions (#1018) 2022-07-19 18:59:19 -07:00
Greg Miller 3d7d85f245
LUCENE-10653: Heapify in BMMScorer (#1022) 2022-07-19 13:49:31 -07:00
Greg Miller a35dee5b27
Small tweak to IntervalQuery#visit logic (#1007) 2022-07-19 12:27:41 -07:00
Adrien Grand 11e7fe6618 LUCENE-10657: Move CHANGES entry to 9.3. 2022-07-19 09:39:18 +02:00
luyuncheng e5bf76b843
LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput (#1034)
Abstract method copyBytes need to copy from input to a buffer and then write into ByteBuffersDataOutput, i think there is unnecessary, we can override it, copy directly from input into output
2022-07-19 09:37:07 +02:00
hcqs33 9f80fea502
Fix error in TieredMergePolicy (#1028)
Fix error in comparing between bytes of candidates and bytes of max merge.
It's wrong to use candidateSize rather than currentCandidateBytes comparing with maxMergeBytes.
2022-07-19 09:21:09 +02:00
Adrien Grand 216e38a159
Synchronize FieldInfos#verifyFieldInfos. (#1019)
This method is called from `addIndexes` and should be synchronized so that it
would see consistent data structures in case of concurrent indexing that would
be introducing new fields.

I hit a rare test failure of `TestIndexRearranger` that I can only explain by this lack of locking:

```
15:40:14    >     java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot read field "numDimensions" because "props" is null
15:40:14    >         at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
15:40:14    >         at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
15:40:14    >         at org.apache.lucene.misc.index.IndexRearranger.execute(IndexRearranger.java:98)
15:40:14    >         at org.apache.lucene.misc.index.TestIndexRearranger.testRearrangeUsingBinaryDocValueSelector(TestIndexRearranger.java:97)
15:40:14    >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
15:40:14    >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
15:40:14    >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
15:40:14    >         at java.base/java.lang.reflect.Method.invoke(Method.java:568)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
15:40:14    >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
15:40:14    >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
15:40:14    >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
15:40:14    >         at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
15:40:14    >         at java.base/java.lang.Thread.run(Thread.java:833)
15:40:14    >
15:40:14    >         Caused by:
15:40:14    >         java.lang.NullPointerException: Cannot read field "numDimensions" because "props" is null
15:40:14    >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.FieldInfos$FieldNumbers.verifySameSchema(FieldInfos.java:459)
15:40:14    >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.FieldInfos$FieldNumbers.verifyFieldInfo(FieldInfos.java:359)
15:40:14    >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:3149)
15:40:14    >             at org.apache.lucene.misc.index.IndexRearranger.addOneSegment(IndexRearranger.java:139)
15:40:14    >             at org.apache.lucene.misc.index.IndexRearranger.lambda$execute$0(IndexRearranger.java:92)
15:40:14    >             at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
15:40:14    >             at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
15:40:14    >             at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
15:40:14    >             ... 1 more
```
2022-07-18 16:17:29 +02:00
Adrien Grand 0364402b30 Fix rare test failures of TestIndexSorting.
Sometimes the final merge might not require sorting depending on the merge
policy configuration.
2022-07-18 15:26:08 +02:00
Vigya Sharma 30a7c52e6c
LUCENE-10649: Fix failures in TestDemoParallelLeafReader (#1025) 2022-07-18 14:32:38 +02:00
Greg Miller 9b185b99c4
LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition (#1021) 2022-07-13 20:02:17 -07:00
Vigya Sharma ca7917472b
LUCENE-10648: Fix failures in TestAssertingPointsFormat.testWithExceptions (#1012)
* Fix failures in TestAssertingPointsFormat.testWithExceptions

* remove redundant finally block

* tidy

* remove TODO as it is done now
2022-07-13 13:55:08 -04:00
Christine Poerschke 56462b5f96
LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method (#821) 2022-07-13 18:43:31 +01:00
Greg Miller 7c35311f29
Specialize ordinal encoding for SortedSetDocValues (#1010) 2022-07-12 18:55:54 -07:00
tang donghai d7c2def019
LUCENE-10619: Optimize the writeBytes in TermsHashPerField (#966) 2022-07-12 17:14:37 +02:00
Greg Miller d6dbe4374a Move LUCENE-10614 CHANGES entry to 10.0 and add MIGRATE entry 2022-07-11 09:10:58 -07:00
Yuting Gan 5ef7e5025d
LUCENE-10614: Properly support getTopChildren in RangeFacetCounts (#974) 2022-07-11 09:04:46 -07:00
Vigya Sharma 128869d63a
LUCENE-10647: Fix TestMergeSchedulerExternal failures (#1011)
Ensure mergeScheduler.sync() gets called before we rollback the writer.
2022-07-11 11:23:17 +02:00
Vigya Sharma c06b98262c
add comment for no pauses in writeBytes (#1014) 2022-07-11 11:12:00 +02:00
Stefan Vodita dd4e8b82d7
LUCENE-10603: Stop using SortedSetDocValues.NO_MORE_ORDS in tests (#1004) 2022-07-07 09:54:41 -07:00
zacharymorn da8143bfa3
LUCENE-10480: Move scoring from advance to TwoPhaseIterator#matches to improve disjunction within conjunction (#1006) 2022-07-07 01:10:50 -07:00
Vigya Sharma 698f40ad51
LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges. (#633)
* Use merge policy and merge scheduler to run addIndexes merges

* wrapped reader does not see deletes - debug

* Partially fixed tests in TestAddIndexes

* Use writer object to invoke addIndexes merge

* Use merge object info

* Add javadocs for new methods

* TestAddIndexes passing

* verify field info schemas upfront from incoming readers

* rename flag to track pooled readers

* Keep addIndexes API transactional

* Maintain transactionality - register segments with iw after all merges complete

* fix checkstyle

* PR comments

* Fix pendingDocs - numDocs mismatch bug

* Tests with 1-1 merges and partial merge failures

* variable renaming and better comments

* add test for partial merge failures. change tests to use 1-1 findmerges

* abort pending merges gracefully

* test null and empty merge specs

* test interim files are deleted

* test with empty readers

* test cascading merges triggered

* remove nocommits

* gradle check errors

* remove unused line

* remove printf

* spotless apply

* update TestIndexWriterOnDiskFull to accept mergeException from failing addIndexes calls

* return singleton reader mergespec in NoMergePolicy

* rethrow exceptions seen in merge threads on failure

* spotless apply

* update test to new exception type thrown

* spotlessApply

* test for maxDoc limit in IndexWriter

* spotlessApply

* Use DocValuesIterator instead of DocValuesFieldExistsQuery for counting soft deletes

* spotless apply

* change exception message for closed IW

* remove non-essential comments

* update api doc string

* doc string update

* spotless

* Changes file entry

* simplify findMerges API, add 1-1 merges to MockRandomMergePolicy

* update merge policies to new api

* remove unused imports

* spotless apply

* move changes entry to end of list

* fix testAddIndicesWithSoftDeletes

* test with 1-1 merge policy always enabled

* please spotcheck

* tidy

* test - never use 1-1 merge policy

* use 1-1 merge policy randomly

* Remove concurrent addIndexes findMerges from MockRandomMergePolicy

* Bug Fix: RuntimeException in addIndexes

Aborted pending merges were slipping through the merge exception check in
API, and getting caught later in the RuntimeException check.

* tidy

* Rebase on main. Move changes to 10.0

* Synchronize IW.AddIndexesMergeSource on outer class IW object

* tidy
2022-07-06 18:15:47 -04:00
Peter Gromov d537013e70
LUCENE-10626: Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion (#975) 2022-07-05 21:38:03 +02:00
Adrien Grand 3dd9a5487c
LUCENE-10636: Avoid computing the same scores multiple times. (#1005)
`BlockMaxMaxscoreScorer` would previously compute the score twice for essential
scorers.

Co-authored-by: zacharymorn <zacharymorn@gmail.com>
2022-07-05 10:14:02 +02:00
Adrien Grand 81d4a7a69f
LUCENE-10151: Some fixes to query timeouts. (#996)
I noticed some minor bugs in the original PR #927 that this PR should fix:
 - When a timeout is set, we would no longer catch
   `CollectionTerminatedException`.
 - I added randomization to `LuceneTestCase` to randomly set a timeout, it
   would have caught the above bug.
 - Fixed visibility of `TimeLimitingBulkScorer`.
2022-07-04 17:32:38 +02:00
zacharymorn 503ec55973
LUCENE-10480: Use BMM scorer for 2 clauses disjunction (#972) 2022-07-02 13:26:16 -07:00
Julie Tibshirani 187f843e2a
LUCENE-10577: Add vectors format unit test and fix toString (#998)
We forgot to add this unit test when introducing the new 9.3 vectors format.
This commit adds the test and fixes issues it uncovered in toString.
2022-07-02 19:00:29 +02:00
Uwe Schindler fb617e29d0 Remove deprecations in main (#978) 2022-07-01 16:32:50 +02:00
Uwe Schindler 8a70988e63
Remove/deprecate obsolete constants in oal.util.Constants; remove code which is no longer executed after Java 9 (#978) 2022-07-01 16:24:18 +02:00
Greg Miller 5f2a4998a0
LUCENE-10603: Migrate remaining SSDV iteration to use docValueCount in production code (#995) 2022-06-30 14:01:14 -07:00
Greg Miller e05b3ec7de Add missing CHANGES entries for GH#983 and GH#984 2022-06-29 13:08:01 -07:00
Greg Miller bede1c3e8c
Switch Float/IntTaxonomyFacets to primitive list data structures in getAllChildren (#984) 2022-06-29 13:01:59 -07:00
Greg Miller 5c3d92d69d
Some refactoring/cleanup of AbstractSortedSetDocValueFacetCounts (#983) 2022-06-29 13:01:47 -07:00
Michael Sokolov e078bc1cd9 fix minor style issues in test 2022-06-29 11:09:46 -04:00
Michael Sokolov 95de554b65 CHANGES entry for LUCENE-10151 2022-06-29 10:35:09 -04:00
Deepika0510 af05550ebf
LUCENE-10151: Adding Timeout Support to IndexSearcher (#927)
Authored-by: Deepika Sharma <dpshrma@amazon.com>
2022-06-29 10:32:12 -04:00
Dawid Weiss 5328a70a29
Update randomizedtesting to 2.8.0, hppc to 0.9.1, morfologik to 2.1.9. (#991) 2022-06-28 15:42:07 +02:00
Alessandro Benedetti 8cf694fed2
LUCENE-10593: VectorSimilarityFunction reverse removal (#926)
* Vector Similarity Function reverse property removed

* NeighborQueue tie-breaking fixed (node id + node score encoding)

* NeighborQueue readability refactor

* BoundChecker removal (now it's only in backward-codecs)
2022-06-28 15:33:11 +02:00
Lu Xugang 3e74ebbc0d
Add entry (#990) 2022-06-28 12:03:18 +08:00
Lu Xugang d8fb47b674
LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues (#967) 2022-06-28 11:49:47 +08:00
Julie Tibshirani 7b58088bd5
Fix FieldExistsQuery rewrite when all docs have vectors (#986)
Before we were checking the number of vectors in the segment against the total
number of documents in IndexReader. This meant FieldExistsQuery would not
rewrite to MatchAllDocsQuery when there were multiple segments.
2022-06-27 09:59:53 +02:00
Kaival Parikh 03846b468e
LUCENE-10606: For KnnVectorQuery, optimize case where filter is backed by BitSetIterator (#951)
Instead of collecting hit-by-hit using a `LeafCollector`, we break down the
search by instantiating a weight, creating scorers, and checking the underlying
iterator. If it is backed by a `BitSet`, we directly update the reference (as
we won't be editing the `Bits`). Else we can create a new `BitSet` from the
iterator using `BitSet.of`.
2022-06-27 08:52:52 +02:00
Shai Erera 9338909373
Fix typos and minor refactoring to FacetConfig (#982) 2022-06-27 07:34:21 +03:00
Shai Erera ef486ea11a Use TestUtil.nextInt() instead of random().nextInt() 2022-06-25 18:13:38 +03:00
Marc D'mello f6bb9d218c
LUCENE-10274: Add FacetSets faceting capabilities (#841)
Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
Co-authored-by: Shai Erera <serera@gmail.com>
Co-authored-by: Greg Miller <gsmiller@gmail.com>
2022-06-25 17:27:11 +03:00
Uwe Schindler eafc6420f3
Exclude Lucene's own JAR files from classpath entries in Eclipse config (#976) 2022-06-24 19:42:42 +02:00
Christine Poerschke 9199d48e7e
Remove outdated comment in UnifiedHighlighter.get(Formatter|Scorer) javadoc. (#820) 2022-06-24 18:00:28 +01:00
Luca Cavanna 2ccec3deb7
Rework TestElevationComparator (#980)
The test used to leave hanging threads behind following a failure. Also one method was executing two different tests. I split the existing method into two and I am now leveraging setup and teardown to properly close all the resources both when the tests succeed as well as whey they fail.
2022-06-24 18:11:51 +02:00