Commit Graph

13930 Commits

Author SHA1 Message Date
Greg Miller f79b316bd5 Add CHANGES entry for GH#12334 2023-05-31 15:18:34 -07:00
Chaitanya Gohel d44be24025
Fix searchafter high latency when after value is out of range for segment (#12334) 2023-05-31 15:07:53 -07:00
Daniele Antuzi da36c24cb9
Use thread-safe search version of HnswGraphSearcher (#12246)
Addressing comment received in the PR https://github.com/apache/lucene/pull/12246
2023-05-30 15:38:06 +01:00
Luca Cavanna 72b91156f3
Don't generate stacktrace for TimeExceededException (#12335)
The exception is package private and never rethrown, we can avoid
generating a stacktrace for it.
2023-05-30 10:29:46 +02:00
Patrick Zhai d1850e44f3
Update TestVectorUtilProviders.java (#12338) 2023-05-29 16:26:29 -07:00
Uwe Schindler db0c21f25d Clenaup and update changes and synchronize with 9.x 2023-05-26 18:22:51 +02:00
Jonathan Ellis 431dc7b415
add BitSet.clear() (#12268) 2023-05-26 18:13:16 +02:00
Greg Miller 367b03bfc2
GH#12321: Reduce visibility of StringsToAutomaton (#12331) 2023-05-26 08:55:02 -07:00
Uwe Schindler f5f25777d8 Update changes to be correct with ARM (it is called NEON there) 2023-05-26 16:53:39 +02:00
Luca Cavanna 24712d7525 Move changes entry for #12328 to 9.7 2023-05-26 15:11:21 +02:00
Armin Braun fd75807350
Optimize ConjunctionDISI.createConjunction (#12328)
This method is showing up as a little hot when profiling some queries.
Almost all the time spent in this method is just burnt on ceremony
around stream indirections that don't inline.
Moving this to iterators, simplifying the check for same doc id and also saving one iteration (for the min
cost) makes this method far cheaper and easier to read.
2023-05-26 13:44:39 +02:00
Luca Cavanna 0ce6b9a67b Adjust changes entries for knn query concurrent rewrite
Moved entry for #12160 to 9.7.0 as it's been backported.
Added missing entry for #12325.
2023-05-26 09:25:54 +02:00
Luca Cavanna 10bebde269
Parallelize knn query rewrite across slices rather than segments (#12325)
The concurrent query rewrite for knn vectory query introduced with #12160
requests one thread per segment to the executor. To align this with the
IndexSearcher parallel behaviour, we should rather parallelize across
slices. Also, we can reuse the same slice executor instance that the
index searcher already holds, in that way we are using a
QueueSizeBasedExecutor when a thread pool executor is provided.
2023-05-26 09:17:25 +02:00
Uwe Schindler c188d47a8b
Handle jdk.internal classes mentioned in vector superclass or interfaces during extraction (#12329) 2023-05-25 17:21:03 +02:00
Michael McCandless 7da7c43638
#12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton (#12310)
Closes #12276
2023-05-25 10:18:41 -04:00
Chris Hegarty f756f90644
Integrate the Incubating Panama Vector API (#12311)
Leverage accelerated vector hardware instructions in Vector Search.

Lucene already has a mechanism that enables the use of non-final JDK APIs, currently used for the Previewing Pamana Foreign API. This change expands this mechanism to include the Incubating Pamana Vector API. When the jdk.incubator.vector module is present at run time the Panamaized version of the low-level primitives used by Vector Search is enabled. If not present, the default scalar version of these low-level primitives is used (as it was previously).

Currently, we're only targeting support for JDK 20. A subsequent PR should evaluate JDK 21.
---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Robert Muir <rmuir@apache.org>
2023-05-25 07:59:50 +01:00
Andrey Bozhko c9c49bc553
[MINOR] Update javadoc in Query class (#12233)
- add a few missing full stops
- update wording in the description of Query#equals method
2023-05-23 12:16:50 +02:00
Patrick Zhai 8a602b5063
Add multi-thread searchability to OnHeapHnswGraph (#12257) 2023-05-21 21:48:46 -07:00
Peter Gromov a454388b80
hunspell (minor): reduce allocations when processing compound rules (#12316) 2023-05-19 21:36:05 +02:00
Uwe Schindler 84e2e3afc3
Make sure APIJAR reproduces with different timezone (unfortunately java encodes the date using local timezone) (#12315) 2023-05-19 18:42:55 +02:00
Uwe Schindler a8a95e64ce Forward port references to AccessController in VirtualMethod (#12308) 2023-05-19 16:38:24 +02:00
Jerry Chin 04ef6de826
GITHUB-12291: Skip blank lines from stopwords list. (#12299) 2023-05-18 16:58:32 +02:00
Michael Sokolov 6b51cce0b8 NeighborQueue.reset() now clears incomplete flag 2023-05-18 10:23:22 -04:00
Greg Miller 3e4ca4042c
Minor cleanup and improvements to DaciukMihovAutomatonBuilder (#12305) 2023-05-18 07:01:19 -07:00
Michael Sokolov 2facb3ae0e Revert "allocate one NeighborQueue per search for results (#12255)"
This reverts commit 9a7efe92c0.
2023-05-18 13:42:17 +00:00
Petr Portnov | PROgrm_JARvis 0c6e8aec67
Seal `IndexReader` and `IndexReaderContext` (#12296) 2023-05-17 08:47:47 +02:00
tang donghai f53eb28af0
remove max recursion from Operations.java to AutomatonTestUtil.java (#12298)
Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2023-05-16 07:09:28 -04:00
Patrick Zhai 8af305892d
Optimize HNSW diversity calculation (#12235) 2023-05-15 23:20:31 -07:00
tang donghai 0e172b0723
Update Javadoc for topoSortStates method after #12286 (#12292) 2023-05-15 18:06:01 +02:00
tang donghai 5d203f8337
toposort use iterator to avoid stackoverflow (#12286)
Co-authored-by: tangdonghai <tangdonghai@meituan.com>
2023-05-15 16:20:15 +02:00
Luca Cavanna 223e28ef16
Simplify SliceExecutor and QueueSizeBasedExecutor (#12285)
The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.
2023-05-11 11:08:48 +02:00
Marcus 963ed7ce88
`ToParentBlockJoinQuery` Explain Support Score Mode (#12245)
* `ToParentBlockJoinQuery` Explain Support Score Mode

---------

Co-authored-by: Mikhail Khludnev <mkhl@apache.org>
2023-05-10 19:10:37 +03:00
Luca Cavanna b6100d9787
Make TimeExceededException members final (#12271)
TimeExceededException has three members that are set within its constructor and never modified. They can be made final.
2023-05-09 11:28:23 +02:00
Luca Cavanna 082c49a9ef
Update javadocs for QueryTimeout (#12272)
QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.
2023-05-09 11:27:47 +02:00
Luca Cavanna 10bad40ed3
Make query timeout members final in ExitableDirectoryReader (#12274)
There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.
2023-05-09 11:27:06 +02:00
Luca Cavanna 1cd9c1d66a add missing changelog entry for #12220 2023-05-09 10:57:28 +02:00
Luca Cavanna 67bb384f72 add missing changelog entry for #12260 2023-05-09 10:52:03 +02:00
Luca Cavanna 9579d2de76 Move changes entry for #12270 to 9.7.0 section 2023-05-09 10:28:22 +02:00
Armin Braun add9aba16d
Don't generate stacktrace in CollectionTerminatedException (#12270)
CollectionTerminatedException is always caught and never exposed to users so there's no point in filling
in a stack-trace for it.
2023-05-09 10:18:52 +02:00
Jonathan Ellis 9a7efe92c0
allocate one NeighborQueue per search for results (#12255) 2023-05-08 17:22:58 -04:00
Michael Sokolov a39885fdab
GITHUB-12224: remove KnnGraphTester (moved to luceneutil) (#12238) 2023-05-08 10:12:36 -04:00
Uwe Schindler 397c2e547a
Fix MMapDirectory documentation for Java 20 (#12265) 2023-05-05 12:04:38 +02:00
Luca Cavanna caeabf3930
Fix SynonymQuery equals implementation (#12260)
The term member of TermAndBoost used to be a Term instance and became a
BytesRef with #11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.
2023-05-03 11:27:33 +02:00
Jonathan Ellis 3c163745bb Use HashMap (was TreeMap) for OnHeapHnswGraph neighbors 2023-04-30 17:59:39 -04:00
Patrick Zhai 1fa2be90ea Tidy the main branch 2023-04-26 21:21:57 -07:00
Alan Woodward 7374c200a1 Add next minor version 9.7.0 2023-04-26 16:44:47 +01:00
Christoph Büscher f45e096304
Add ordering of files in compound files (#12241)
Today there is no specific ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is undefined. This commit changes to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are held in cache.
2023-04-26 14:01:02 +01:00
Luca Cavanna b0befef912
QueryProfilerWeight to extend FilterWeight (#12242)
QueryProfilerWeight should override matches and delegate to the
subQueryWeight. Another way to fix this issue is to make it extend
ProfileWeight and override only methods that need to have a different
behaviour than delegating to the sub weight.
2023-04-26 10:24:57 +02:00
Alessandro Benedetti 4deb0003c4 Word2VecSynonymFilter constructor null check (#12169) 2023-04-24 17:28:12 +02:00
Daniele Antuzi 1f4f2bf509
Introduced the Word2VecSynonymFilter (#12169)
Co-authored-by: Alessandro Benedetti <a.benedetti@sease.io>
2023-04-24 13:35:26 +02:00
Peter Gromov 5e0761eab5 remove timeout dependency from TestHunspell.testSuggestionOrderStabilityOnDictionaryEditing 2023-04-23 21:16:56 +02:00
Peter Gromov 025dfec2dd
Hunspell: reduce suggestion set dependency on the hash table order (#12239)
* Hunspell: reduce suggestion set dependency on the hash table order

When adding words to a dictionary, suggestions for other words shouldn't change unless they're directly related to the added words.
But before, GeneratingSuggester selected 100 best first matches from the hash table, whose order can change significantly after adding any unrelated word.
That resulted in unexpected suggestion changes on seemingly unrelated dictionary edits.
2023-04-23 16:51:17 +02:00
Stefan Vodita 2e7426961b
Remove statement that SSDV facets aren't hierarchical (#12232) 2023-04-21 18:40:08 -04:00
Peter Gromov 60c9039d9f mention "GITHUB#12220: Hunspell: disallow hidden title-case entries from compound middle/end" CHANGES.txt 2023-04-21 15:17:33 +02:00
Usman Shaikh bed07c6b02
Update Javadoc comment to mention gradle instead of ant (#12201) 2023-04-18 22:14:19 -07:00
Kartik Ganesh 3813f5ab7c
Change the access modifier for the "expert" readLatestCommit API to public. (#12229)
This change also includes a unit test for this functionality.

Signed-off-by: Kartik Ganesh <gkart@amazon.com>
2023-04-18 14:38:35 -04:00
Andrey Bozhko 2d0dc6407a
Avoid redundant copies of BytesRef when constructing new Term (#12234) 2023-04-15 22:44:14 -07:00
Vigya Sharma 4e88118a35
Fix typo in CheckJoinIndex (#12231) 2023-04-14 14:06:19 -07:00
Marcus 2d7908e3c9
Explain term automaton queries (#12208) 2023-04-08 16:09:42 -07:00
Patrick Zhai c31017589b
Remove a test in TestDocumentsWriterDeleteQueue (#12223) 2023-04-04 10:49:14 -07:00
Peter Gromov 56aef7265a
hunspell: disallow hidden title-case entries from compound middle/end (#12220)
if we only have custom-case uART and capitalized UART, we shouldn't accept StandUart as a compound (although we keep hidden "Uart" dictionary entries for internal purposes)
2023-04-03 20:06:58 +02:00
Adrien Grand 56e65919b1
Adjust DWPT pool concurrency to the number of cores. (#12216)
After upgrading Elasticsearch to a recent Lucene snapshot, we observed a few
indexing slowdowns when indexing with low numbers of cores. This appears to be
due to the fact that we lost too much of the bias towards larger DWPTs in
apache/lucene#12199. This change tries to add back more ordering by adjusting
the concurrency of `DWPTPool` to the number of cores that are available on the
local node.
2023-03-31 15:07:48 +02:00
Greg Miller 172dfaf867 changes entry for GH#12212 2023-03-29 11:09:22 -07:00
Frederic Thevenet df1b0baa69
Fixes Searches made via DrillSideways may miss documents that should match the query (#12212) 2023-03-29 11:05:58 -07:00
Uwe Schindler b84b360f58
Upgrade forbiddenapis to version 3.5 (#12215)
Upgrade forbiddenapis to version 3.5.  This tones down some verbose warnings printed while checking Java 19 and Java 20 sourcesets for the MR-JAR
2023-03-27 13:30:22 +02:00
Hongyu Yan a6475cecbf
Fix ordered intervals query over interleaved terms (#12214)
Given an input text 'A B A C A B C' and search ORDERED(A, B, C), we should 
retrieve hits [0,3] and [4,6]; currently [4,6] is skipped.

After finding the first interval [0, 3], the subintervals will become A[0,0], B[1,1], 
C[3,3]; then the algorithm will try to minimize it and the subintervals will 
become: A:[2,2], B:[5,5], C:[3,3] (after finding 5 > 3 it breaks the minimization)

And when finding next interval, it will do advance(B) before checking whether 
it is after A(the do-while loop), so subintervals will become A[2,2], B[inf, inf], 
C[3,3] and return NO_MORE_INTERVAL.

This commit instead continues advancing subintervals from where the last
`nextInterval` call stopped, rather than always advancing all subintervals.
2023-03-27 09:18:33 +01:00
Adrien Grand 0782535017
Fully reuse postings enums when flushing sorted indexes. (#12206)
Currently we're only half reusing postings enums when flushing sorted indexes
as we still create new wrapper instances every time, which can be costly with
fields that have many terms.
2023-03-16 13:51:33 +01:00
Patrick Zhai d3b6ef3c86
Refactor part of IndexFileDeleter and ReplicaFileDeleter into a common utility class (#12126) 2023-03-15 20:51:49 -07:00
Adrien Grand f324204019
Reduce contention in DocumentsWriterPerThreadPool. (#12199)
Obtaining a DWPT and putting it back into the pool is subject to contention.
This change reduces contention by using 8 sub pools that are tried sequentially.
When applied on top of #12198, this reduces the time to index geonames with 20
threads from ~19s to ~16-17s.
2023-03-15 13:17:40 +01:00
Adrien Grand 805eb0b613
Use radix sort to sort postings when index sorting is enabled. (#12114)
This switches to LSBRadixSorter instead of TimSorter to sort postings whose
index options are `DOCS`. On a synthetic benchmark this yielded barely any
difference in the case when the index order is the same as the sort order, or
reverse, but almost a 3x speedup for writing postings in the case when the
index order is mostly random.
2023-03-15 11:56:45 +01:00
Adrien Grand d407edf4b8
Reduce contention in DocumentsWriterFlushControl. (#12198)
lucene-util's `IndexGeoNames` benchmark is heavily contended when running with
many indexing threads, 20 in my case. The main offender is
`DocumentsWriterFlushControl#doAfterDocument`, which runs after every index
operation to update doc and RAM accounting.

This change reduces contention by only updating RAM accounting if the amount of
RAM consumption that has not been committed yet by a single DWPT is at least
0.1% of the total RAM buffer size. This effectively batches updates to RAM
accounting, similarly to what happens when using `IndexWriter#addDocuments` to
index multiple documents at once. Since updates to RAM accounting may be
batched, `FlushPolicy` can no longer distinguish between inserts, updates and
deletes, so all 3 methods got merged into a single one.

With this change, `IndexGeoNames` goes from ~22s to ~19s and the main offender
for contention is now `DocumentsWriterPerThreadPool#getAndLock`.

Co-authored-by: Simon Willnauer <simonw@apache.org>
2023-03-15 11:39:40 +01:00
Lu Xugang 62e175bf4f
Unrelated code in TestIndexSortSortedNumericDocValuesRangeQuery (#12153) 2023-03-15 15:22:32 +08:00
Marcus dfd9e0fe97
Remove the Now Unused Class `pointInPolygon`. (#12159)
Removes the unused Tessellator.pointInPolygon method.
2023-03-14 09:16:02 -05:00
Jasir KT 4851bd74f4
Fix boost missing in MultiFieldQueryParser (#12202)
When using boost along with any of fuzzy, wildcard, regexp, range or prefix queries, the boost is not applied.
2023-03-13 08:27:42 -04:00
Uwe Schindler e4d8a5c5cb
Implement MMapDirectory with Java 20 Project Panama Preview API (#12188) 2023-03-09 21:27:31 +01:00
Jasir KT 96efb34d00
Fix Slop Issue in MultiFieldQueryParser (#12196)
In Lucene 5.4.0 62313b83ba9c69379e1f84dffc881a361713ce9 introduced some changes for immutability of queries. setBoost() function was replaced with new BoostQuery(), but BoostQuery is not handled in setSlop function. This commit adds the handling of BoostQuery in setSlop() function.
2023-03-08 11:16:09 -05:00
Greg Miller 0651d25713
Fixup TestLongValueFacetCounts after GITHUB#11744. (#12192)
GH#11744 deprecated LongValueFacetCounts#getTopChildrenSortByCount in favor
of the standard Facets#getTopChildren. The issue is that #getTopChildrenSortByCount
didn't do any input validation and allowed for topN == 0, while #getTopChildren
does input validation. Randomized testing could produce topN values of 0, which
resulted in falied tests. This addresses the tests.
2023-03-06 18:13:03 -08:00
Greg Miller afd3a7efbe
Remove LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality (#11744) 2023-03-06 12:12:23 -08:00
Tyler Bertrand c514089d66
Gradle optimizations (#12150)
* Define inputs and outputs for task validateJarLicenses
  * Lazily configure validateJarLicenses
* Move functionality from copyTestResources task into processTestResources task
  * Lazily configure processTestResources
  * Altered TestCustomAnalyzer.testStopWordsFromFile() to find resources in updated location
* Resolve "overlapping output" issue preventing processTestResources from being cached
* Provide system properties from CommandLineArgumentProviders
  * Configure certain system properties as inputs to take advantage of UP-TO-DATE checking
  * Applies the correct pathing strategies to take full advantage of caching even if builds are executed from different locations on disk
* Make validateSourcePatterns task cacheable by removing .gradle directory from its input
2023-03-06 19:17:37 +01:00
Greg Miller b4f969c197
Better PostingsEnum reuse in MultiTermQueryConstantScoreBlendedWrapper (#12179) 2023-03-06 09:09:52 -08:00
Christine Poerschke 3bd06b1cb9
GITHUB-12181: fix false-positive TestKnnFloatVectorQuery.testDocAndScoreQueryBasics() failure (#12182) 2023-03-06 15:29:36 +00:00
Kaival Parikh e0d92eef98
Concurrent rewrite for KnnVectorQuery (#12160)
- Reduce overhead of non-concurrent search by preserving original execution
- Improve readability by factoring into separate functions

---------

Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
2023-03-04 01:12:11 -08:00
Greg Miller 569533bd76
Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod (#12175) 2023-03-01 08:25:44 -08:00
Greg Miller 00910cd6a4 Actually remove TermInSetQuery#getTermData (follow-up on #12173) 2023-03-01 05:44:24 -08:00
Greg Miller c8741f7c58
Deprecate TermInSetQuery#getTermData (#12173) 2023-03-01 05:42:39 -08:00
Greg Miller 3809106602
Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery (#12156) 2023-03-01 05:20:28 -08:00
Greg Miller 001acaf882
Clone the BytesRef[] values in KeywordField#newSetQuery (#12158) 2023-02-28 18:30:38 -08:00
Greg Miller b23b7475e1 Follow up on GH#12055 to remove un-referenced test methods 2023-02-27 16:03:44 -08:00
Adrien Grand c6667e709f
Better skipping for multi-term queries with a FILTER rewrite. (#12055)
This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method
meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term
queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a
disjunction if 16 terms or less match the query. Otherwise postings lists of
matching terms are collected into a `DocIdSetBuilder`. This change replaces the
latter with a mixed approach where a disjunction is created between the 16
terms that have the highest document frequency and an iterator produced from
the `DocIdSetBuilder` that collects all other terms. On fields that have a
zipfian distribution, it's quite likely that no high-frequency terms make it to
the `DocIdSetBuilder`. This provides two main benefits:
 - Queries are less likely to allocate a FixedBitSet of size `maxDoc`.
 - Queries are better at skipping or early terminating.
On the other hand, queries that need to consume most or all matching documents may get a slowdown, so
users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new
default for PrefixQuery, WildcardQuery and TermRangeQuery. 

Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>
2023-02-27 15:11:31 -08:00
Adrien Grand 38d253fb53
Lazily resolve ordinals when merging. (#12170)
The default implementation of merging doc values resolves the ordinal of a
document in `nextDoc()`. But sometimes, doc values iterators are consumed
without retrieving ordinals, e.g. to write the set of documents that have a
value, so this may be wasteful.

With this change, ordinals get resolved lazily upon `ordValue()`.
2023-02-27 17:38:59 +01:00
Adrien Grand 2d157bd348
Remove LogMergePolicy's boundary at the floor level. (#12113)
`LogMergePolicy` has this boundary at the floor level that prevents merging
segments above the minimum segment size with segments below this size. I cannot
see a benefit from doing this, and no tests fail if I remove it, while this
boundary has the downside of not running merges that seem legit to me. Should
we remove this boundary check?
2023-02-27 17:38:34 +01:00
Adrien Grand cce33b07e4
Skip the TokenStream overhead when indexing simple keywords. (#12139)
Indexing simple keywords through a `TokenStream` abstraction introduces a bit
of overhead due to attribute management. Not much, but indexing keywords boils
down to adding to a hash map and appending to a postings list, which is quite
cheap too so even some low overhead can significantly impact indexing speed.
2023-02-21 14:00:11 +01:00
Benjamin Trent dbfca9a62b
Minor vector search matching doc optimizations (#12152)
The two minor performance improvements are around count and Weight#scorer.
segmentStarts is a monotonically increasing start for each scored document indexed by leaf-segment ordinal. Consequently, if the upper and lower segments are equivalent, that means no docs match for this segment.

Count is similarly calculated by the difference between upper and lower segmentStarts according to the segment ordinal.
2023-02-21 07:51:03 -05:00
Greg Miller 7506f8462f
Speed up DocValuesRewriteMethod by making use of sortedness (#12155) 2023-02-19 08:33:26 -08:00
Robert Muir 3ad2ede395
Implement ScorerSupplier for Sorted(Set)DocValuesField#newSlowRangeQuery (#12132)
Similar to use of ScorerSupplier in #12129, implement it here too,
because creation of a Scorer requires lookupTerm() operations in the DV
terms dictionary. This results in wasted effort/random accesses, if, based on the cost(),
IndexOrDocValuesQuery decides not to use this query.
2023-02-17 08:25:17 -05:00
Julie Tibshirani 8340b01c3c
Simplify max score for kNN vector queries (#12146)
The helper class DocAndScoreQuery implements advanceShallow to help skip
non-competitive documents. This method doesn't actually keep track of where it
has advanced, which means it can do extra work.

Overall the complexity here didn't seem worth it, given the low cost of
collecting matching kNN docs. This PR switches to a simpler approach, which uses
a fixed upper bound on the max score.
2023-02-16 12:03:59 -08:00
Nhat Nguyen 8e15c665be
Ensure caching all leaves from the upper tier (#12147)
This change adjusts the cache policy to ensure that all segments in the 
max tier to be cached. Before, we cache segments that have more than 3%
of the total documents in the index; now cache segments have more than
half of the average documents per leave of the index.

Closes #12140
2023-02-16 10:41:28 -08:00
Julie Tibshirani 54044a82a0
Improve DocAndScoreQuery#toString (#12148)
Tiny improvements to DocAndScoreQuery:
* Make toString more informative
* Remove unnecessary 'k' parameter
2023-02-15 09:50:55 -08:00
Benjamin Trent 7baa01b3c2
Force merge into a single segment before getting the directory reader (#12138)
The test assumes a single segment is created (because only one scorer is created from the leaf contexts).

But, force merging wasn't done before getting the reader. Forcemerge to a single segment before getting the reader.
2023-02-09 11:27:08 -05:00
Alan Woodward f38d51ee89
Don't wrap readers when checking for term vector access in test (#12136)
In TestUnifiedHighlighterTermVec, we have a special reader which counts the
number of times term vectors are accessed, so that we can assert that caching
works correctly here. There is some special logic in place to skip the check
when the test framework wraps readers with CheckIndex or ParallelReader;
however, this logic no longer works with ParallelReader in particular, because
term vectors are now accessed through an anonymous class.

A simpler solution here is to call newSearcher(reader, false), which disables
wrapping, meaning that we can remove this extra logic entirely.

Fixes #12115
2023-02-09 09:29:41 +00:00
John Mazanec 776149f0f6
Reuse HNSW graph for intialization during merge (#12050)
* Remove implicit addition of vector 0

Removes logic to add 0 vector implicitly. This is in preparation for
adding nodes from other graphs to initialize a new graph. Having the
implicit addition of node 0 complicates this logic.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Enable out of order insertion of nodes in hnsw

Enables nodes to be added into OnHeapHnswGraph in out of order fashion.
To do so, additional operations have to be taken to resort the
nodesByLevel array. Optimizations have been made to avoid sorting
whenever possible.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Add ability to initialize from graph

Adds method to initialize an HNSWGraphBuilder from another HNSWGraph.
Initialization can only happen when the builder's graph is empty.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Utilize merge with graph init in HNSWWriter

Uses HNSWGraphBuilder initialization from graph functionality in
Lucene95HnswVectorsWriter. Selects the largest graph to initialize the
new graph produced by the HNSWGraphBuilder for merge.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Minor modifications to Lucene95HnswVectorsWriter

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Use TreeMap for graph structure for levels > 0

Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of
levels greater than 0. Refactors NodesIterator to support set
representation of nodes.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Refactor initializer to be in static create method

Refeactors initialization from graph to be accessible via a create
static method in HnswGraphBuilder.

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Address review comments

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Add change log entry

Signed-off-by: John Mazanec <jmazane@amazon.com>

* Remove empty iterator for neighborqueue

Signed-off-by: John Mazanec <jmazane@amazon.com>

---------

Signed-off-by: John Mazanec <jmazane@amazon.com>
2023-02-07 14:42:03 -05:00
Adrien Grand ab074d5483
Introduce a new `KeywordField`. (#12054)
`KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`,
similarly to how `LongField` is a combination of `LongPoint` and
`SortedNumericDocValuesField`. This makes it easier for users to create fields
that can be used for filtering, sorting and faceting.
2023-02-07 18:19:09 +01:00
Adrien Grand d69326408c Fix javadoc references. 2023-02-07 11:06:04 +01:00
Adrien Grand 8b572f074e Fix nightly compatibility tests after #12116. 2023-02-07 10:46:30 +01:00
Uwe Schindler bc09c2a0d9
Add tests for size() and contains() to LongHashSet; fix size bug with MISSING (#12134) 2023-02-07 00:52:29 +01:00
Uwe Schindler 57403e26e0
Simplify LongHashSet by completely removing java.util.Set APIs (#12133) 2023-02-06 22:43:20 +01:00
Uwe Schindler 8564da434d
Generate gradle.properties from gradlew (#12131)
* SOLR-16641 - Generate gradle.properties from gradlew (#1320)
* Adapt for Lucene
* Remove localSettings from smoker; thanks @colvinco
* Print properties at end for debugging
* Add CHANGES.txt entry

---------

Co-authored-by: Colvin Cowie <colvin.cowie.dev@gmail.com>
Co-authored-by: Colvin Cowie <51863265+colvinco@users.noreply.github.com>
2023-02-06 19:47:15 +01:00
Robert Muir 02b202866c
Add CHANGES.txt for #12129 2023-02-06 12:49:49 -05:00
Robert Muir 0bc4135695
Speedup sandbox/DocValuesTermsQuery (#12129)
* Optimize the common case that docs only have single values for the field
* In the multivalued case, terminate reading docvalues if they are > maximum set ordinal
* Implement ScorerSupplier, so that (potentially large) number of ordinal lookups aren't performed just to get the cost()
* Graduate to Sorted(Set)DocValuesField.newSlowSetQuery to complement newSlowRangeQuery, newSlowExactQuery

Like other slow queries in these classes, it's currently only recommended to use with points, e.g. IndexOrDocValuesQuery(new PointInSetQuery, newSlowSetQuery)
2023-02-06 12:47:53 -05:00
Robert Muir 10d9c7440b
Speed up docvalues set query by making use of sortedness (#12128)
LongHashSet is used for the set of numbers, but it has some issues:
* tries to hard to extend AbstractSet, mostly for testing
* causes traps with boxing if you aren't careful
* complex hashcode/equals

Practically we should take advantage of the fact numbers come in sorted
order for multivalued fields: just like range queries do. So we use
min/max to our advantage, including termination of docvalues iteration

Actually it is generally a win to just check min/max even in the single-valued
case: these constant time comparisons are cheap and can avoid hashing,
etc.

In the worst-case, if all of your query Sets contain both the minimum and maximum
possible values, then it won't help, but it doesn't hurt either.
2023-02-06 12:14:02 -05:00
Robert Muir a6bceb7cf0
Remove useless abstractions in DocValues-based queries (#12127)
There's no need to make things abstract: DocValues does the right thing
Optimizing for where no docs for the field in the segment exist is easy, simple null check (replacing the existing one!)
2023-02-06 11:49:25 -05:00
Adrien Grand 684f31ef06 Fix ambiguous links. (#12116) 2023-02-06 17:45:05 +01:00
Adrien Grand 96136b4282
Improve document API for stored fields. (#12116)
Currently stored fields have to look at binaryValue(), stringValue() and
numericValue() to guess the type of the value and then store it. This has a few
issues:
 - If there is a problem, e.g. all of these 3 methods return null, it's
   currently discovered late, when we already passed the responsibility of
   writing data from IndexingChain to the codec.
 - numericValue() is used both for numeric doc values and storage. This makes
   it impossible to implement a `double` field that is stored and doc-valued,
   as numericValue() needs to return simultaneously a number that consists of
   the double for storage, and the long bits of the double for doc values.
 - binaryValue() is used both for sorted(_set) doc values and storage. This
   makes it impossible to implement `keyword` fields that is stored and
   doc-valued, as the field returns a non-null value for both binaryValue() and
   stringValue() and stored fields no longer know which field to store.

This commit introduces `IndexableField#storedValue()`, which is used only for
stored fields. This addresses the above issues. IndexingChain passes the
storedValue() directly to the codec, so it's impossible for a stored fields
format to mistakenly use binaryValue()/stringValue()/numericValue() instead of
storedValue().
2023-02-06 17:08:13 +01:00
Benjamin Trent c2bef381d1
Fix TestFeatureField#testBasicsNonScoringCase test (#12130)
Sometimes the random search lucene test searcher will wrap the reader. Consequently, we need to make sure to use the reader provided by the test IndexSearcher or the reader may be different between creating the weight with the searcher vs. accessing the leaf context for the scorer.
2023-02-06 10:15:00 -05:00
Benjamin Trent 4bbc273a43
Add `FeatureQuery` weight caching in non-scoring case (#12118)
While FeatureQuery is a powerful tool in the scoring case, there are scenarios when caching should be allowed and scoring disabled.

A particular case is when the FeatureQuery is used in conjunction with learned-sparse retrieval. It is useful to iterate and calculate the entire matching doc set when combined with various other queries.

related to: https://github.com/apache/lucene/issues/11799
2023-02-02 13:00:33 -05:00
Ioana Tagirta d591c9c37a
Always generate a polygon that has no self intersections (#12124) 2023-02-02 16:57:42 +01:00
Jean-François BOEUF 5acca82633
Reduce bloom filter size by using the optimal count for hash functions. (#11900) 2023-02-01 14:35:50 +01:00
Marc D'Mello f9cb6a3b42
GITHUB-11868: Add FilterIndexInput and FilterIndexOutput wrapper classes (#11958)
Co-authored-by: Marc D'Mello <dmellomd@amazon.com>
2023-02-01 14:11:12 +01:00
Luca Cavanna 57397f0cab
Adjust return type for VectorUtil methods (#12122)
Two of the methods (squareDistance and dotProduct) that take byte arrays return a float while
the variable used to store the value is an int. They can just return an int.
2023-02-01 10:48:05 +01:00
Luca Cavanna ce433e5449
Remove VectorUtil#toBytesRef (#12121)
The method is currently only used in its corresponding test method.
2023-02-01 10:47:50 +01:00
Luca Cavanna 73e2ae2705 Add back-compat indices for 9.5.0 2023-01-30 20:50:01 +01:00
Luca Cavanna 2bd87b7909 Fix formatting of Version.java 2023-01-25 13:49:23 +01:00
Luca Cavanna 59da15b0e5 Add minor version 9.6.0 2023-01-25 13:42:38 +01:00
Luca Cavanna 72c8b334a0 Add missing LUCENE_9_5_0 version 2023-01-25 13:37:48 +01:00
Benjamin Trent d1fa52e62f
Fix flaky TestHnswByteVectorGraph.testSortedAndUnsortedIndicesReturnSameResults test (#12110) 2023-01-25 10:47:20 +01:00
Luca Cavanna 5a51ce1d5d
SimpleText codec to support writing byte vectors (#12111)
A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written.
This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter.

Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support
byte vectors and will throw unsupported operation exception when attempting to do that.
2023-01-25 10:43:35 +01:00
Luca Cavanna 95e2cfcc1e
Remove deprecated float vector classes and methods (#12107)
Follow-up of #12105 to remove the deprecated classes for the next major version.

Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.
2023-01-24 16:25:36 +01:00
Adrien Grand ce8eaf138c
MemoryIndex should not fail integer fields that enable doc values. (#12109)
When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast
to `java.lang.Long`. However, the new `IntField` represents the value as a
`java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with
`IndexingChain` by casting to `Number` and calling `Number#longValue` instead
of casting to `Long`.
2023-01-24 11:51:49 +01:00
Luca Cavanna 25623a63bf format changelog entry and add missing author name 2023-01-24 10:57:52 +01:00
Luca Cavanna 92e67ec626
Rename vector float classes and methods (#12105)
We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values.

This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones.

As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues.

Relates to #11963
2023-01-24 10:20:24 +01:00
Luca Cavanna f5bd28662f Remove BytesRef usage from SortingCodeReader
Follow-up of #12102
2023-01-23 22:29:10 +01:00
Luca Cavanna 4594400216
Replace BytesRef usages in byte vectors API with byte[] (#12102)
The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore
2023-01-23 22:06:00 +01:00
Luca Cavanna f8ee852696
add missing changelog for #12064 2023-01-23 21:22:15 +01:00
Adrien Grand de94fa97fb
Remove binaryValue() on VectorValues and ByteVectorValues. (#12101)
This method tries to expose an encoded view of vectors, but we shouldn't have
this part of our user-facing API. With this change, the way vectors are encoded
is entirely on the codec.
2023-01-23 14:42:49 +01:00
Alessandro Benedetti 77eca4bb38
Introduce getters for KnnVectorQuery(#12029) 2023-01-23 12:35:08 +01:00
Chris Hostetter 9007f746a3 WordBreakSpellChecker now correctly respects maxEvaluations (#12077) 2023-01-22 15:44:29 -07:00
Vigya Sharma 519adcc954
Fix failure in TestIndexSortSortedNumericDocValuesRangeQuery.testCountBoundary (#12098) 2023-01-19 16:16:42 -08:00
Lu Xugang a9fd21b6af
Same bound with fallbackQuery (#12084)
IndexSortSortedNumericDocValuesRangeQuery should have the same bound with fallbackQuery.
2023-01-19 14:33:58 +08:00
Vigya Sharma dc33ade76d
Remove UTF8TaxonomyWriterCache (#12092)
Removes the never-evicting UTF8TaxonomyWriterCache, changing the default to LruTaxonomyWriterCache
2023-01-18 13:20:26 -08:00
twosom 318b002e0b
fix typo in KoreanNumberFilter (#12045)
* fix typo in KoreanNumberFilter

* fix doc format
2023-01-17 22:32:59 -08:00
Robert Muir 4fe8424925
Graduate DocValuesNumbersQuery from lucene/sandbox to newSlowSetQuery() (#12087)
Clean up this query a bit and support:
* NumericDocValuesField.newSlowSetQuery()
* SortedNumericDocValuesField.newSlowSetQuery()

This complements the existing docvalues-based range queries, with a set query.

Add ScorerSupplier/cost estimation support to PointInSetQuery
Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery
2023-01-16 09:38:08 -05:00
Alan Woodward fc7b937aff
Don't throw UOE when highlighting FieldExistsQuery (#12088)
WeightedSpanTermExtractor will try to rewrite queries that it doesn't
know about, to see if they end up as something it does know about and
that it can extract terms from. To support field merging, it rewrites against
a delegating leaf reader that does not support getFieldInfos().

FieldExistsQuery uses getFieldInfos() in its rewrite, which means that
if one is passed to WeightedSpanTermExtractor, we get an
UnsupportedOperationException thrown.

This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery,
so that it can just ignore it and avoid throwing an exception.
2023-01-16 11:47:51 +00:00
Robert Muir e06f8c2e8b
Update to error-prone 2.17 (#12056) 2023-01-14 11:38:39 -05:00
Robert Muir 8ca05967e8
remove non-NRT replication support (#12038)
Remove non-NRT replication support in 10.x (to be deprecated in 9.5)
2023-01-14 11:14:46 -05:00
Adrien Grand b5062a2858
MultiCollector shouldn't report that scores are needed when they're not. (#12083)
When sub collectors don't agree on their `ScoreMode`, `MultiCollector`
currently returns `COMPLETE`. This makes sense when assuming that there is
likely one collector computing top hits (`TOP_SCORES`) and another one
computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it
is also possible to have one collector computing top hits by field (`TOP_DOCS`)
and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector`
shouldn't report that scores are needed in that case.
2023-01-13 14:44:17 +01:00
Luca Cavanna 90a5a71448 fix typo in changelog 2023-01-13 14:12:06 +01:00
Lu Xugang 102622842b
Enhance XXXField#newRangeQuery (#12078)
Introduce IndexSortSortedNumericDocValuesRangeQuery to IntFiled#newRangeQuery and LongField#newRangeQuery
2023-01-13 18:12:26 +08:00
Adrien Grand aaab028266
Speed up DocIdMerger on sorted indexes. (#12081)
In the case when an index is sorted on a low-cardinality field, or the index
sort order correlates with the order in which documents get ingested, we can
optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on
the next sub. This checks covers at the same time whether the priority queue
needs reordering and whether the current sub reached `NO_MORE_DOCS`.
2023-01-12 18:27:45 +01:00
Adrien Grand 729fedcbac
Speed up 1D BKD merging. (#12079)
On the NYC taxis dataset on my local machine, switching from
`Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15%
speedup of BKD merging.
2023-01-12 18:14:15 +01:00
Benjamin Trent 59b17452aa
Fix exponential runtime for Boolean#rewrite (#12072)
When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially.

The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively.

This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite.

closes: #12069
2023-01-12 16:50:05 +01:00
Adrien Grand 9ab324d2be Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:52:12 +01:00
Adrien Grand 525a11091c Revert "Allow reusing indexed binary fields. (#12053)"
This reverts commit 84778549af.
2023-01-12 09:48:06 +01:00
Adrien Grand 84778549af
Allow reusing indexed binary fields. (#12053)
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.

This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.

Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
2023-01-12 09:32:13 +01:00
Patrick Zhai d3d9ab0044
Drop wrong assertion in TestBooleanQuery.testQueryMatchesCount (#12051) 2023-01-11 10:44:06 -08:00
Adrien Grand a8ef03d979
Never throttle creation of compound files. (#12070)
`ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a
proxy for CPU usage, in order to prevent merging from disrupting searches too
much. However creating compound files are lightweight CPU-wise and do not need
throttling.

Closes #12068
2023-01-11 09:57:13 +01:00
Adrien Grand 56ec51e558
Cut over Lucene Demo from LongPoint to LongField. (#12052) 2023-01-11 09:43:43 +01:00
Benjamin Trent cc29102a24
Create new KnnByteVectorField and KnnVectorsReader#getByteVectorValues(String) (#12064) 2023-01-11 09:20:47 +01:00
Erik Pellizzon e14327288e
Documenting that IndexReaderContext#leaves() will never return a null value and remove the null checks from the method calls (#12034) 2023-01-06 14:35:52 -08:00
Uwe Schindler 5fccaec166 Remove deprecated APIs after #12066; this also removes another one missed to be removed before 2023-01-05 11:53:16 +01:00
Uwe Schindler 7f483bd618
Retire/deprecate per-instance MMapDirectory#setUseUnmap (#12066) 2023-01-04 19:17:03 +01:00
Uwe Schindler 2f602c01dd
Add a sysprop "org.apache.lucene.store.MMapDirectory.enableMemorySegments" (#12062) 2023-01-03 19:10:28 +01:00
Lu Xugang 19cc6cdf66
Out of boundary in CombinedFieldQuery#addTerm (#12046) 2023-01-03 15:33:36 +08:00
Uwe Schindler e2ee09d0c5
Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9 that has the bean, but without properties (#12058) 2023-01-02 00:05:36 +01:00
twosom 4676a735c1
fix typo analysis-kuromoji (#12047) 2023-01-01 10:58:50 -05:00
Patrick Zhai 4eab1d74e8
Fix TestRangeOnRaneFacetCounts dimention overflow error (#12049) 2022-12-30 13:18:51 -08:00
Greg Miller 21107d811b Add CHANGES entry for GITHUB#11869 2022-12-30 07:46:51 -08:00
Marc D'Mello cbfed77fd3
Github#11869: Add RangeOnRangeFacetCounts (#11901) 2022-12-30 07:38:13 -08:00
Adrien Grand 6f477e5831
Optimize flush of doc-value fields that are effectively single-valued when an index sort is configured. (#12037)
This iterates on #399 to also optimize the case when an index sort is
configured. When cutting over the NYC taxis benchmark to the new numeric
fields,
[flush times](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#flush_times)
stayed mostly the same when index sorting is disabled and increased by 7-8%
when index sorting is enabled. I expect this change to address this slowdown.
2022-12-27 11:12:56 +01:00
Adrien Grand ddd63d2da3
Tune the amount of memory that is allocated to sorting postings upon flushing. (#12011)
When flushing segments that have an index sort configured, postings lists get
loaded into arrays and get reordered according to the index sort.

This reordering is implemented with `TimSorter`, a variant of merge sort. Like
merge sort, an important part of `TimSorter` consists of merging two contiguous
sorted slices of the array into a combined sorted slice. This merging can be
done either with external memory, which is the classical approach, or in-place,
which still runs in linear time but with a much higher factor. Until now we
were allocating a fixed budget of `maxDoc/64` for doing these merges with
external memory. If this is not enough, sorted slices would be merged in place.

I've been looking at some profiles recently for an index where a non-negligible
chunk of the time was spent on in-place merges. So I would like to propose the
following change:
 - Increase the maximum RAM budget to `maxDoc / 8`. This should help avoid
   in-place merges for all postings up to `docFreq = maxDoc / 4`.
 - Make this RAM budget lazily allocated, rather than eagerly like today. This
   would help not allocate memory in O(maxDoc) for fields like primary keys
   that only have a couple postings per term.

So overall memory usage would never be more than 50% higher than what it is
today, because `TimSorter` never needs more than X temporary slots if the
postings list doesn't have at least 2*X entries, and these 2*X entries already
get loaded into memory today. And for fields that have short postings, memory
usage should actually be lower.
2022-12-27 11:11:18 +01:00
Adrien Grand e9dc4f9188
Avoid sorting values of multi-valued writers if there is a single value. (#12039)
They currently call `Arrays#sort`, which incurs a tiny bit of overhead due to
range checks and some logic to determine the optimal sorting algorithm to use
depending on the number of values. We can skip this overhead in the case when
there is a single value.
2022-12-27 11:03:06 +01:00
Zach Chen 008a0d4206
Remove IOContext from Directory#openChecksumInput (#12027) 2022-12-26 11:45:42 -08:00
Uwe Schindler c9401bf064
Patch class files for Java 19 code to no longer have the "preview" flag (this enables Java 19 memory segments by default) (#12033) 2022-12-26 10:07:44 +01:00
Uwe Schindler 92f08aff9f Make childLog final to fix compilation on Java 20. This closes #12041 2022-12-25 14:55:33 +01:00
Lu Xugang 3bc8cd5c20
Aggressive `count` in BooleanWeight (#12017) 2022-12-22 23:48:05 +08:00
twosom ad22fb2879
Fix typo in AbstractQueryConfig javadocs (#12031) 2022-12-22 13:57:29 +01:00
twosom 5c78e04a17
fix typo in BaseSynonymParserTestCase (#12030)
Co-authored-by: hope <hope@gravylab.co.kr>
2022-12-21 13:28:52 -05:00
Egor Potemkin d18e3f1d45
Issue #11582 Update Faceting user guide (#12025)
Update faceting user guide to modern times.

Co-authored-by: Egor Potemkin <epotyom@amazon.com>
2022-12-21 12:20:18 -05:00
Francisco Fernández Castaño 57201aa967
Add IntField, LongField, FloatField and DoubleField (#11997)
This commit adds new IndexableFields that index both points and doc
values at once.

Closes #11199
2022-12-20 18:19:46 +01:00
Benjamin Trent 1412e559d9
Clean up KNN related backward-codecs changes (#12019) 2022-12-20 14:04:42 +01:00
Andriy Redko 945d7fe027
Upgrade ANTLR to version 4.11.1 (#12016)
Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md

"Q: What are the main design decisions in ANTLR4?
Ease-of-use over performance. I will worry about performance later."

It allows us to move forward with newer antlr but hopefully prevent the associated headaches.

Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
Co-authored-by: Robert Muir <rmuir@apache.org>
2022-12-15 22:40:35 -05:00
Craig Taverner 3e8ef57e3f
Fix flat polygons incorrectly containing intersecting geometries (#12022) 2022-12-15 14:56:09 +01:00
Benjamin Trent 11f2bc2056
Fix SimpleTextKnnVectorsReader to handle changes introduced in GITHUB#12004 (#12024) 2022-12-15 14:49:47 +01:00
Benjamin Trent 72968d30ba
Move byte vector queries into new KnnByteVectorQuery (#12004) 2022-12-14 09:53:10 +01:00
Robert Muir 9eeab8c4a6
Remove deprecated API in 10.x (#11998) 2022-12-13 10:32:15 -05:00
Robert Muir 47f8c1baa2
Migrate away from per-segment-per-threadlocals on SegmentReader (#11998)
Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2022-12-13 09:10:21 -05:00
Ignacio Vera ef5766aa81
Fix algorithm that chooses the bridge between a polygon and a hole (#11988) 2022-12-13 10:16:53 +01:00
Robert Muir 06f9179295
Enable LongDoubleConversion error-prone check (#12010) 2022-12-12 20:55:39 -05:00
Greg Miller e34234ca6c
Remove unnecessary NaN checks from LongRange#verifyAndEncode (#12008) 2022-12-11 12:55:21 -08:00
Greg Miller 8671e29929
Some minor code cleanup in IndexSortSortedNumericDocValuesRangeQuery (#12003)
* Leverage DISI static factory methods more over custom DISI impl where possible.
* Assert points field is a single-dim.
* Bound cost estimate by the cost of the doc values field (for sparse fields).
2022-12-10 12:23:31 -08:00
gf2121 54e00df7f6
Do int compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries (#12006) 2022-12-11 02:30:17 +08:00
gf2121 9ff989ec00
Use ByteArrayComparator to replace Arrays#compareUnsigned in some other places (#11880) 2022-12-08 23:51:08 +08:00
Alan Woodward 66127f6e69
Add support for stored fields to MemoryIndex (#11999) 2022-12-08 09:56:24 +00:00
Adrien Grand a971120d05
Make RandomAccessVectorValues an implementation detail of HNSW implementations rather than a proper API. (#11964)
`RandomAccessVectorValues` is internally used in our HNSW implementation to
provide random access to vectors, both at index and search time. In order to
better reflect this, this change does the following:
 - `RandomAccessVectorValues` moves to `org.apache.lucene.util.hnsw`.
 - `BufferingKnnVectorsWriter` no longer has a dependency on
   `RandomAccessVectorValues` and moves to `org.apache.lucene.codecs` since
   it's more of a utility class for KNN vector file formats than an index API.
   Maybe we should think of moving it near each file format that uses it
   instead.
 - `SortingCodecReader` no longer has a dependency on
   `RandomAccessVectorValues`.

Closes #10623
2022-12-08 08:49:37 +01:00
Adrien Grand 95df7e8109
Generalize range query optimization on sorted indexes to descending sorts. (#11972)
This generalizes #687 to indexes that are sorted in descending order. The main
challenge with descending sorts is that they require being able to compute the
last doc ID that matches a value, which would ideally require walking the BKD
tree in reverse order, but the API only support moving forward. This is worked
around by maintaining a stack of `PointTree` clones to perform the search.
2022-12-08 08:38:53 +01:00
Benjamin Trent d0be9ab57c
GITHUB-11830 Better optimize storage for vector connections (#11860) 2022-12-07 08:51:54 +01:00
Karl David Wright 108462a005 Followup work for #11883 2022-12-03 08:07:10 -05:00
Costin Leau 4eba6a1284
Add exponential growth to TimeLimitingBulkScorer (#11984)
Increase the timeout check inside TimeLimitBulkScorer at exponential rate.

Fix #11676
2022-12-02 09:20:48 -08:00
Robert Muir fad3108b27
fix wrong serialization by ShapeDocValues (#11974)
Closes #11973
2022-12-01 20:32:42 -05:00
Alan Woodward 72ff140f5a
Don't let merged passages push out lower-scoring ones (#11990)
PassageScorer uses a priority queue of size maxPassages to keep track of
which highlighted passages are worth returning to the user. Once all
passages have been collected, we go through and merge overlapping
passages together, but this reduction in the number of passages is not
compensated for by re-adding the highest-scoring passages that were pushed
out of the queue by passages which have been merged away.

This commit increases the size of the priority queue to try and account for
overlapping passages that will subsequently be merged together.
2022-12-01 12:25:29 +00:00
Luca Cavanna bd168ac2a8 Add changes entry for #11985 2022-11-30 10:13:39 +01:00
Luca Cavanna 343d888b30
ExitableTerms to override getMin and getMax (#11985)
ExitableTerms should not iterate through the terms to retrieve min and max when the wrapped implementation has the values cached (e.g. FieldsReader, OrdsFieldReader)
2022-11-30 10:06:31 +01:00
Alan Woodward 0cc6f69536
Give OffsetsRetrievalStrategy implementations public constructors (#11983)
OffsetsFromMatchIterator and OffsetsFromPositions both have package-
private constructors, which makes them difficult to use as components in a
separate highlighter implementation.
2022-11-28 16:22:46 +00:00
Karl David Wright 5c4896321d Merge branch 'GITHUB-11883' into main
Pulling in changes to address ticket 11883.
2022-11-25 16:32:02 -05:00
Karl David Wright 74e8b94796 Fix for 11883. 2022-11-25 16:17:18 -05:00
Karl David Wright 6dc6b5b0dd As part of GITHUB-11883, develop new primitive Plane constructors to build boundary planes specific for each polygon edge. 2022-11-25 14:56:38 -05:00
Greg Miller 2e83c3b40f
Fix NPE in BinaryRangeFieldRangeQuery when field does not exist or is of wrong type (#11950) 2022-11-25 11:38:41 -08:00
Robert Muir 4e93f29318
fix bad shift amounts and enable check (#11979) 2022-11-25 11:47:25 -05:00
Robert Muir 545c93a394
fix use of wrong array toString() method in test, enable check (#11978) 2022-11-25 11:47:04 -05:00
Robert Muir 4885b5f856
fix use of wrong array equals() method in test, enable check (#11977) 2022-11-25 11:46:48 -05:00
Robert Muir f4286493d1
fix variable assigned to itself in test and enable check (#11980) 2022-11-25 11:45:45 -05:00
Karl David Wright b5f94b6754 Add test that tweaks identical planes in intersections bug 2022-11-25 07:40:45 -05:00
Karl David Wright b5dd71198d Refactor, restoring isWithinSection and making sure it is properly called. 2022-11-24 02:47:06 -05:00
Shubham Chaudhary b15ace46b2
Remove QueryTimeout#isTimeoutEnabled method and move check to caller (#11954)
Co-authored-by: Shubham <cshbha@amazon.com>
2022-11-24 16:37:20 +01:00
Adrien Grand 28576eb99d Fix precommit. 2022-11-24 11:44:21 +01:00
Simon Cooper 135f3fab41
Ensure collections are properly sized on creation (#11942)
A few other optimisations along the way
2022-11-24 11:20:04 +01:00
Karl David Wright 839dfb5a2d More refactoring work, and fix a distance calculation. 2022-11-23 23:36:15 -05:00
Karl David Wright 5e4623af1f For 11965, add structural changes that would allow intersection calls to also be O(log(n)). Disabled though because test failures are the result of enabling it - work ongoing. 2022-11-23 15:07:57 -05:00
Karl David Wright 482f8251ff More work related to 11965: Improve performance of nearestDistance queries somewhat by removing unnecessary code. 2022-11-23 12:21:38 -05:00
Adrien Grand 802774641a
Enforce VectorValues.cost() is equal to size(). (#11962)
`VectorValues` have a `cost()` method that reports an approximate number of
documents that have a vector, but also a `size()` method that reports the
accurate number of vectors in the field. Since KNN vectors only support
single-valued fields we should enforce that `cost()` returns the `size()`.
2022-11-23 11:05:00 +01:00
Adrien Grand 20c1ba5d9a
Remove VectorValues#EMPTY. (#11961)
This instance is illegal as it reports a number of dimensions equal to zero.
2022-11-23 10:52:12 +01:00
Adrien Grand 8bdc59ce67 Add back-compat indices for 9.4.2 2022-11-23 10:35:06 +01:00
Dawid Weiss 30873cfcd9 Fix the boxing issue again. 2022-11-23 08:29:12 +01:00
Karl David Wright 5fec8efe4e More tidying to make lint happy 2022-11-22 22:05:55 -05:00
Karl David Wright 49c8a75917 Resolve merge conflicts 2022-11-22 21:29:06 -05:00
Karl David Wright 0593eca73d Fix problem in new Plane code 2022-11-22 21:18:11 -05:00
Karl David Wright fc7ce76851 Refactor and make hierarchical GeoStandardPath. Some tests fail and will need to be researched further. 2022-11-22 21:18:11 -05:00
Karl David Wright 1ded41ea20 Final bugs fixed, except remaining legacy issue with nearest distance in GeoDegeneratePath. 2022-11-22 21:12:44 -05:00
Karl David Wright c9c27c755a Make sure use of aggregation form is consistent throughout, and fix segment endpoint computations of nearestDistance. 2022-11-22 18:43:19 -05:00
Karl David Wright ae5179986c All tests fixed saved two - distance related. 2022-11-22 16:56:34 -05:00
Adrien Grand 7b7cb396e5 Tidy. 2022-11-22 18:57:21 +01:00
Adrien Grand 750e7dba32 Add bugfix version 9.4.2 2022-11-22 18:56:30 +01:00
Karl David Wright 799421abba Fix nearestDistance for real this time 2022-11-22 12:55:18 -05:00
Peter Gromov 2ae8dd632e
hunspell: support empty dictionaries, adapt to the hunspell/C++ repo changes (#11960)
hunspell: support empty dictionaries, adapt to the hunspell/C++ repo changes
2022-11-22 18:23:45 +01:00
Mike McCandless ad04ac1bc4 tidy up 2022-11-22 08:20:30 -05:00
Mike McCandless 4bd4f8b521 remove unused imports 2022-11-22 08:18:41 -05:00
Mike McCandless acbc08fb32 expand wildcard imports 2022-11-22 08:03:54 -05:00
Mike McCandless 521c0e24f2 #10878: revert #02528c6757d10420cc7d545282b49c4322943ac7 (add some test verbosity on failure (#11935)) 2022-11-22 07:59:08 -05:00
Stefan Vodita 369a70f289
Support deletions in rearrange (#11815)
* Support deletions in rearrange 
* Store BinaryDocValues in the binary doc value selector as ByteRef
   instead of String.
2022-11-21 23:52:38 -08:00
Karl David Wright 5d341e9d8c Cleanup StandardGeoPath to get rid of unused member arrays 2022-11-21 20:36:25 -05:00
Karl David Wright ecf4396ef7 Remove a dated test 2022-11-21 20:01:30 -05:00
Karl David Wright d85c35e4d7 Fix problem in new Plane code 2022-11-21 19:15:10 -05:00
Karl David Wright 20fcd0b757 Refactor and make hierarchical GeoStandardPath. Some tests fail and will need to be researched further. 2022-11-21 18:39:50 -05:00
Alan Woodward 332679886c
Add field as a separate input to newSynonymQuery (#11941)
QueryBuilder#newSynonymQuery takes an array of TermAndBoost objects as a
parameter and uses the field of the first term in the array as its field. However,
there are cases where this array may be empty, which will result in an
ArrayOutOfBoundsException.

This commit reworks QueryBuilder so that TermAndBoost contains plain
BytesRefs, and passes the field as a separate parameter. This guards against
accidental calls to newSynonymQuery with an empty list - in this case, an
empty synonym query is generated rather than an exception. It also
refactors SynonymQuery itself to hold BytesRefs rather than Terms, which
needlessly repeat the field for each entry.

Fixes #11864
2022-11-21 09:55:14 +00:00
Karl David Wright cd82a9bbdc Revert the change the relies on accurate bounds from path components. This caused randomized test failures, and fixing the bounds caused other (inexplicable) test failures. More research needed. 2022-11-20 10:32:14 -05:00
Robert Muir 8fdac2d88e
fixed boxed equality check to unbreak the build: has this code been tested???? 2022-11-19 23:11:30 -05:00
Karl David Wright 1e236090af Fix up formatting 2022-11-19 17:56:10 -05:00
Karl David Wright 9bca7a70e1 Fix longstanding bug in path bounds calculation, and hook up efficient isWithin() and distance logic 2022-11-19 17:56:09 -05:00
Karl David Wright fbdb655221 Add node structures and fast operations for them. 2022-11-19 17:56:08 -05:00
Robert Muir 62f2b42502
Prevent TestStressIndexing from taking minutes for normal non-NIGHTLY runs (#11953)
This test intentionally does a ton of filesystem operations: currently
about 20% of the time you can get really unlucky and get virus checker
simulated, against a real filesystem, which makes things really slow.

Instead use a ByteBuffersDirectory for local runs so that it doesn't
take minutes. The test can still be pretty slow even with this
implementation, so tone down the runtime so that it takes ~ 1.5s
locally.
2022-11-19 18:06:52 -05:00
Dawid Weiss 3f6410b738
Implement source code regeneration for test-framework perl scripts (#11952) 2022-11-19 23:40:45 +01:00