Commit Graph

36608 Commits

Author SHA1 Message Date
Matthias Osswald 3af0c6872a Add Option to Set Subtoken Position Increment for Dictonary Decompounder
This pull request adds a new feature to Lucene's DictionaryDecompounder. Now, you can set the position increment of subtokens to one. This feature is required when you're doing AND searches that involve subtokens.

Right now, the position increment is set to zero. That's how DictionaryDecompounder currently operates. But with this update, users can set the subtokenPositionIncrement to one. This changes the position increment of the subtokens to one. This means, if you're using the AND operator in Elasticsearch match clauses to search for 'orangenschokolade', and 'orangen' and 'schokolade' are in your dictionary, it will correctly search for 'orangen AND schokolade'.

By default, the DictionaryDecompounder emits the original compounded token. This behavior remains unchanged when the flag is set to zero. However, when set to one, it changes the DictionaryDecompounder's output to individual subtokens, and the original compounded token will not be emitted.
2023-07-31 15:50:39 +02:00
Adrien Grand 5e725964a0
Improve MaxScoreBulkScorer partitioning logic. (#12457)
Partitioning scorers is an optimization problem: the optimal set of
non-essential scorers is the subset of scorers whose sum of max window scores
is less than the minimum competitive score that maximizes the sum of costs.

The current approach consists of sorting scorers by maximum score within the
window and computing the set of non-essential clauses as the first scorers
whose sum of max scores is less than the minimum competitive score, ie. you
cannot have a competitive hit by matching only non-essential clauses.

This sorting logic works well in the common case when costs are inversely
correlated with maximum scores and gives an optimal solution: the above
algorithm will also optimize the cost of non-essential clauses and thus
minimize the cost of essential clauses, in-turn further improving query
runtimes. But this isn't true for all queries. E.g. fuzzy queries compute
scores based on artificial term statistics, so scores are no longer inversely
correlated with maximum scores. This was especially visible with the query
`titel~2` on the wikipedia dataset, as `title` matches this query and is a
high-frequency term. Yet the score contribution of this term is in the same
order as the contribution of most other terms, so query runtime gets much
improved if this clause gets considered non-essential rather than essential.

This commit optimize the partitioning logic a bit by sorting clauses by
`max_score / cost` instead of just `max_score`. This will not change anything
in the common case when max scores are inversely correlated with costs, but can
significantly help otherwise. E.g. `titel~2` went from 41ms to 13ms on my
machine and the wikimedium10m dataset with this change.
2023-07-29 21:02:28 +02:00
Peter Gromov 32ec38271e
hunspell: check for aff file wellformedness more strictly (#12468)
* hunspell: check for .aff file well-formedness more strictly
2023-07-29 16:07:50 +02:00
Peter Gromov 1af68bf2d7
hunspell: make the hash table load factor customizable (#12464)
* hunspell: make the hash table load factor customizable
2023-07-28 18:36:32 +02:00
Mayya Sharipova 155b2edbe3
Fix occasional failure in BaseKnnVectorsFormatTestCase#testIllegalDimensionTooLarge (#12467)
Depending whether a document with dimensions > maxDims created
on a new segment or already existing segment, we may get
different error messages. This fix adds another possible
error message we may get.

Relates to #12436
2023-07-28 09:37:16 -04:00
Mayya Sharipova 119635ad80
Make KnnVectorsFormat#getMaxDimensions abstract (#12466)
- Backward codecs use 1024 as max dims
- Test classes use the current KnnVectorsFormat#DEFAULT_MAX_DIMENSIONS

Relates to PR#12436
Closes #12309
2023-07-28 08:34:17 -04:00
Mayya Sharipova 98320d7616
Move max vector dims limit to Codec (#12436)
Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.

Closes #12309
2023-07-27 14:50:33 -04:00
Armin Braun 538b7d0ffe
Clean up writing String to ByteBuffersDataOutput (#12455)
Resolving TODO to use UnicodeUtil instead of a copy of its code here.
Maybe slightly slower from the extra check for high-surrogate but that
may be outweigh or better by more compact code and saving the capturing lambda
that might not inline.
2023-07-26 14:25:01 +02:00
Greg Miller 87944c2aa7 Move CHANGES entry for GITHUB#12408 under 10.0
A backport to 9.x will be somewhat tricky with the API surface area
so planning to wait for a 10.0 release.
2023-07-25 13:02:36 -07:00
Greg Miller 179b45bc23
Initialize facet counting data structures lazily (#12408)
This change covers:
* Taxonomy faceting
  * FastTaxonomyFacetCounts
  * TaxonomyFacetIntAssociations
  * TaxonomyFacetFloatAssociations
* SSDV faceting
  * SortedSetDocValuesFacetCounts
  * ConcurrentSortedSetDocValuesFacetCounts
  * StringValueFacetCounts
* Range faceting:
  * LongRangeFacetCounts
  * DoubleRangeFacetCounts
* Long faceting:
  * LongValueFacetCounts

Left for a future iteration:
* RangeOnRange faceting
* FacetSet faceting
2023-07-25 12:20:42 -07:00
Greg Miller 2b3b028734
GITHUB#12451: Update TestStringsToAutomaton validation to work around GH#12458 (#12461) 2023-07-25 11:56:18 -07:00
Armin Braun 20e97fbd00
Faster bulk numeric reads from BufferedIndexInput (#12453)
Reading ints/floats/longs one-by-one from a heap-byte-buffer, including
doing our own bounds checks is not very efficient. We can use the
ability to translate the buffer and read in bulk while taking turns with
one-off reading/refilling instead.
2023-07-24 17:20:32 +02:00
Houston Putman 1c70c40082
Enable search for site javadocs (#12430) 2023-07-24 10:38:19 -04:00
Stefan Vodita 34721f9439
Assert IdxOrDvQuery subqueries and document useful fields (#12442) 2023-07-24 16:36:48 +02:00
Benjamin Trent 59c56a0aed
Fix sorted&unsorted graph test flakiness (#12452)
When running HnswGraphTestCase#testSortedAndUnsortedIndicesReturnSameResults, we search two separate graph structures. These structures can change depending on the order of the vectors seen and consequently a different result set could be returned from the same query.

To account for this, the test had a higher number of exploration candidates (ef_search/k) of 50, but in one particular seed: C8AAF5E4648B4226, it failed.

I have verified that bumping the search candidate pool to 60 fixes the failure.

The total number of vectors still out numbers the requested number of candidates, so the search is still hitting the graph.

I verified further by running the test again over a couple thousand seeds and it didn't fail again.
2023-07-20 14:08:21 -04:00
Adrien Grand 17c13a76c8
Add BS1 optimization to MaxScoreBulkScorer. (#12444)
Lucene's scorers that can dynamically prune on score provide great speedups
when they manage to skip many hits. Unfortunately, there are also cases when
they cannot skip hits efficiently, one example case being when there are many
clauses in the query. In this case, exhaustively evaluating the set of matches
with `BooleanScorer` (BS1) may perform several times faster.

This commit adds to `MaxScoreBulkScorer` the BS1 optimization that consists of
collecting hits into a bitset to save the overhead of reordering priority
queues. This helps make performance degrade much more gracefully when dynamic
pruning cannot help much.

Closes #12439
2023-07-19 13:51:22 +02:00
Martin Demberger 55f2f9958b
LUCENE-8183: Added the abbility to get noSubMatches and noOverlapping Matches (#12437)
---------

Co-authored-by: Martin Demberger <martin.demberger@root-nine.de>
2023-07-19 13:15:33 +02:00
Peter Gromov f05adff4ca
hunspell: speed up the dictionary enumeration (#12447)
* hunspell: speed up the dictionary enumeration

cache each word's case and the lowercase form
group the words by lengths to avoid even visiting entries with unneeded lengths
2023-07-18 21:25:26 +02:00
Stefan Vodita b4619d87ed
Move sliced int buffer functionality to MemoryIndex (#11248) (#12409)
* [WIP] Move IntBlockPool slices to MemoryIndex

* [WIP] Working TestMemoryIndex

* [WIP} Working TestSlicedIntBlockPool

* Working many allocations tests

* Add basic IntBlockPool test

* SlicedIntBlockPool inherits from IntBlockPool

* Tidy
2023-07-10 10:18:28 -04:00
Benjamin Trent d03c8f16d9
Have byte[] vectors also trigger a timeout in ExitableDirectoryReader (#12423)
`ExitableDirectoryReader` did not wrap searching for `byte[]` vectors. Consequently timeouts were not respected with this reader when searching with `byte[]` vectors.

This commit fixes that bug.
2023-07-07 12:29:55 -04:00
Benjamin Trent 861153020a
Fix HNSW graph visitation limit bug (#12413)
We have some weird behavior in HNSW searcher when finding the candidate entry point for the zeroth layer.

While trying to find the best entry point to gather the full candidate set, we don't filter based on the acceptableOrds bitset. Consequently, if we exit the search early (before hitting the zeroth layer), the results that are returned may contain documents NOT within that bitset.

Luckily since the results are marked as incomplete, the *VectorQuery logic switches back to an exact scan and throws away the results.

However, if any user called the leaf searcher directly, bypassing the query, they could run into this bug.
2023-07-06 15:46:36 -04:00
Adrien Grand f527eb3b12
Remove Scorable#docID. (#12407)
`Scorable#docID()` exposes the document that is being collected, which makes it
impossible to bulk-collect multiple documents at once.

Relates #12358
2023-07-05 10:40:06 +02:00
Uwe Schindler 9ffc625b2e Followup on #12410: Fix caller class check to use string literals to allow private/pkg-private classes 2023-07-03 17:52:10 +02:00
Uwe Schindler f668cfd1cd
Fix forUtil.gradle to actually execute python script and also fix type error in script (#12411) 2023-07-03 16:22:03 +02:00
Uwe Schindler fde2e50d9e
Refactor vectorization support in Lucene (#12410) 2023-07-03 11:39:19 +02:00
Perdjesk 907883f701
Correct Javadocs using SimpleBindings (#12402)
Javadocs were still referring the old `SortField` API, which
has been replaced with methods that use `DoubleValuesSource`
instead.
2023-06-30 15:14:24 +01:00
Adrien Grand 8811f31b9c
Add a post-collection hook to LeafCollector. (#12380)
This adds `LeafCollector#finish` as a per-segment post-collection hook. While
it was already possible to do this sort of things on top of the collector API
before, a downside is that the last leaf would need to be post-collected in the
current thread instead of using the executor, which is a missed opportunity for
making queries concurrent.
2023-06-30 15:19:35 +02:00
Sagar 40ee6e583e
Assign a dummy simScorer in TermsWeight if score is not needed (#12383) 2023-06-30 15:14:33 +02:00
Sorabh 223eecca33
Add a thread safe CachingLeafSlicesSupplier to compute and cache the LeafSlices used with concurrent segment (#12374)
search. It uses the protected method `slices` by default to compute the slices which can be
overriden by the sub classes of IndexSearcher
2023-06-30 14:57:34 +02:00
zhangchao 01200b5804
Speed up NumericDocValuesWriter with index sorting (#12381) 2023-06-30 14:56:56 +02:00
Uwe Schindler e503805758
Remove usage and add some legacy java.util classes to forbiddenapis (Stack, Hashtable, Vector) (#12404) 2023-06-29 16:56:41 +02:00
Luca Cavanna f44cc45cf8
Share concurrent execution code into TaskExecutor (#12398)
Lucene has a non-public SliceExecutor abstraction that handles the execution of tasks when search
is executed concurrently across leaf slices. Knn query vector rewrite has similar code that runs
tasks concurrently and waits for them to be completed and handles
eventual exceptions.

This commit shares code among these two scenarios, to reduce code
duplicate as well as to ensure that furhter improvements can be shared among them.
2023-06-28 13:52:01 +02:00
Adrien Grand 4029cc37a7
Fix MaxScoreBulkScorer#score's return value. (#12400)
`AssertingBulkScorer` asserts that the return value of `BulkScorer#score` may
not be in `[maxDoc, NO_MORE_DOCS)`. While this is not part of the contract of
`BulkScorer#score`, a reasonable implementation should never have return values
in this range, as it would suggest that more matches need collecting when we're
already out of the range of valid doc IDs. So this generally indicates a bug.

`MaxScoreBulkScorer` failed this assertion, because it can sometimes skip the
requested window of doc IDs, when the sum of maximum scores would be less than
the minimum competitive score. In that case, the best information it has is
that there are no matches in the window, but it cannot give a good estimate of
the next potential match.

This assertion in `AssertingBulkScorer` looks sane to me, so I made a small
change to `MaxScoreBulkScorer` to make sure it meets `AssertingScorer`'s
expectations. This is done in a place that is only called once per scored
window, so it should not have a noticeable performance impact.
2023-06-28 09:17:41 +02:00
yixunx 6bb8cc0235
Let hard link wrapper fallback to delegate.copyFrom (#12384)
Co-authored-by: Yixun Xu <yixunx@palantir.com>
2023-06-28 09:17:20 +02:00
Stefan Vodita b88d3e1988
Catch offset overflows in byte pool (#9660) (#12392) 2023-06-28 09:16:56 +02:00
Tomas Eduardo Fernandez Lobbe b50b969d25
Update comment about IndexOptions ordinals (#12360)
FieldInfos no longer accepts changes in IndexOptions, however, different IndexOptions are still compared using their ordinals
2023-06-26 16:29:13 -07:00
Alan Woodward 79e8c9c8b9 Fix edge case in TestJoinUtil
TestJoinUtil.checkBoost() needs to check to see if there are
any results to validate, otherwise we can get an array-out-of-bounds
exception
2023-06-26 13:28:09 +01:00
Adrien Grand 4fec812cc5 Add back-compat indices for 9.7.0 2023-06-26 14:24:27 +02:00
Luca Cavanna 3e0dc2b572 Add missing change entires for 9.8 2023-06-26 11:23:36 +02:00
Adrien Grand edc2cf5cd1 DOAP changes for release 9.7.0 2023-06-26 11:05:46 +02:00
Alan Woodward edd799824f
Enable boosts on JoinUtil queries (#12388)
Boosts should not be ignored by queries returned from JoinUtil
2023-06-26 09:47:14 +01:00
Luca Cavanna 7f10dca1e5
Revert "Parallelize knn query rewrite across slices rather than segments (#12325)" (#12385)
This reverts commit 10bebde269.

Based on a recent discussion in
https://github.com/apache/lucene/pull/12183#discussion_r1235739084 we
agreed it makes more sense to parallelize knn query vector rewrite
across leaves rather than leaf slices.
2023-06-26 10:41:18 +02:00
Michael Sokolov cb195bd96e
github-12386: set java.io.tmpdir in replicator tests' forked processes (#12387) 2023-06-23 08:38:06 -04:00
Jonathan Ellis fe0278e36e
Reuse neighborqueue during hnsw index build (attempt 2) (#12372)
This changes HnswGraphBuilder to re-use the same candidates queues for adding nodes by allocating them in the Builder instance.

This saves about 2.5% of build time and takes memory allocations of NQ long[] from 25% of total to 0%. JFR runs are attached.

The difference from the first attempt (which actually made things slower for some graphs) is that it preserves the original code's behavior of using a 1-sized queue for the search in the levels above where the node actually gets added.

* Re-use NeighborQueue during build's search

* improve javadoc for OnHeapHnswGraphSearcher

* assert that results parameter is minheap as expected

* update CHANGES
2023-06-20 15:05:37 -04:00
Adrien Grand 8703e449ce
Change the MAXSCORE scorer to a bulk scorer. (#12361) 2023-06-20 18:55:03 +02:00
zhangchao 37b92adf6a
Avoid redundant loop for compute min value in DirectMonotonicWriter (#12377)
* Avoid redundant loop for get min value

* update CHANGES.txt
2023-06-20 09:12:15 -04:00
Alan Woodward 6d4314d46f Add back-compat indices for 9.6.0 2023-06-16 17:58:31 +02:00
Adrien Grand 6ee7b2b9f6 Add next minor version 9.8.0 2023-06-16 13:39:30 +02:00
Uwe Schindler 148236a50b
This allows VectorUtilProvider tests to be executed although hardware may not fully support vectorization or if C2 is not enabled (#12376) 2023-06-16 12:28:29 +02:00
Luca Cavanna bb6ec50d4c
Increased the likelihood of leveraging inter-segment concurrency in tests (#12369)
We have recently increased the likelihood of leveraging inter-segment search
concurrency in tests when `newSearcher` is used to create the index
searcher (see #959). When parallel execution is enabled though, it is
dependent on the number of documents and segments. That means
that out of 1000 test runs that use `RandomIndexWriter` to index a random
number of docs up to 100, we will effectively parallelize only a couple
of times.

This commit increases the likelihood of running concurrent searches by
randomly forcing 1 max segments per slice as well as 1 max doc per slice.
2023-06-15 11:35:10 +02:00