We consume a lot of memory for the `indexIn` slices. If `indexIn` is of
type `MemorySegmentIndexInput` the overhead of keeping loads of slices
around just for cloning is far higher than the extra 12b per reader this
adds (the slice description alone often costs a lot).
In a number of Elasticsearch example uses with high segment counts I
investigated, this change would save up to O(GB) of heap.
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that
match query terms.
With the introduction of PR #12156 we saw degradation in performance of bool queries
where one of the mandatory clauses is a TermInSetQuery with query terms not present in
the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which
would shortcut the whole bool query.
This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field
doesn't contain any query terms.
Relates to PR #12156
Merges all immutable attributes in FieldInfos.FieldNumbers into one hashmap saving memory when
writing big indices. Fixes an exotic bug when calling clear where not all attributes were cleared.
If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer.
Closes#13410
This follows a similar approach as postings and only prefetches the first page
of data.
I verified that it works well for collectors such as `TopFieldCollector`, as
`IndexSearcher` first pulls a `LeafCollector`, then a `BulkScorer` and only
then starts feeding the `BulkScorer` into the `LeafCollector`. So the
background I/O for the `LeafCollector` which will prefetch the first page of
doc values and the background I/O for the `BulkScorer` will run in parallel.
This applies to files where performing readahead could help:
- Doc values data (`.dvd`)
- Norms data (`.nvd`)
- Docs and freqs in postings lists (`.doc`)
- Points data (`.kdd`)
Other files (KNN vectors, stored fields, term vectors) keep using a `RANDOM`
advice.
We always allocate a long array of page size for a new PackedLongValues#Iterator instance, which is not necessary when packing a small number of values. this is more evident in the scenario of high-frequency flush operations
This adds `StoredFields#prefetch(int)`, which mostly delegates to
`IndexInput#prefetch`. Callers can take advantage of this API to parallelize
I/O across multiple stored documents by first calling `StoredFields#prefetch`
on all doc IDs before calling `StoredFields#document` on all doc IDs.
I added a cache of recently prefetched blocks to the default codec, in order to
avoid prefetching the same block multiple times in a short period of time. This
felt sensible given that doc ID reordering via recursive graph bisection or
index sorting are likely to result in search results being clustered.
When int4 scalar quantization was merged, it added a new way to dynamically calculate quantiles.
However, when that was merged, I inadvertently changed the default behavior, where a null confidenceInterval would actually calculate the dynamic quantiles instead of doing the previous auto-setting to 1 - 1/(dim + 1).
This commit formalizes the dynamic quantile calculate through setting the confidenceInterval to 0, and preserves the previous behavior for null confidenceIntervals so that users upgrading will not see different quantiles than they would expect.
This commit ensures that SimpleText[Float|Byte]VectorValues::scorer returns null when the vector values is empty, as per the scorer javadoc. Other KnnVectorsReader implementations have specialised empty implementations that do similar, e.g. OffHeapFloatVectorValues.EmptyOffHeapVectorValues. The VectorScorer interface in new in Lucene 9.11, see #13181
An existing test randomly hits this, but a new test has been added that exercises this code path consistently. It's also useful to verify other KnnVectorsReader implementations.
Pull request #13406 inadvertly broke Lucene's handling of tragic exceptions by
stopping after the first `DocValuesProducer` whose `close()` calls throws an
exception, instead of keeping calling `close()` on further producers in the
list.
This moves back to the previous behavior.
Closes#13434
This commit updates the MemorySegment scorer so that it ensures the values is of the correct type.
The offset calculations for vectors in RandomAccessQuantizedByteVectorValues will be different than that of non-quantized. We can generalise the implementation for quantized vectors later, but for now, passing a quantised values indicated bug in wrapping or delegation.
* Remove hppc dependency
* Change fork version to 0.10.0
* Add @lucene.internal
* Move hppc classes to oal.internal.hppc but export it.
* Delete hppc license since it's no longer a dependency.
---------
Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
This relates to #13359: we want to take advantage of the `Weight#scorerSupplier` call to start scheduling some I/O in the background in parallel across clauses. For this to work properly with top-level disjunctions, we need to move `#bulkScorer()` from `Weight` to `ScorerSupplier` as well, so that the disjunctive `BooleanQuery` first performs a call to `Weight#scorerSupplier()` on all inner clauses, and then `ScorerSupplier#bulkScorer` on all inner clauses.
`ScorerSupplier#get` and `ScorerSupplier#bulkScorer` only support being called once. This forced me to fix some inefficiencies in `bulkScorer()` implementations when we would pull scorers and then throw it away when realizing that the strategy we were planning on using was not optimal. This is why e.g. `ReqExclBulkScorer` now also supports prohibited clauses that produce a two-phase iterator.
* avoid WrapperDownloader if have the JAR
* don't specify --source
More specific than needed, and some JDK/configs may complain about an incompatibility with --release.
From https://github.com/apache/solr/pull/2419