The introduction of the doc values skip index in #13449 broke the backward codec test as those codecs do not support
it. This commit fix it by breaking up the base class for the tests.
Optional skip list on top of doc values which is exposed via the DocValuesSkipper abstraction. A new flag is
added to FieldType.java that configures whether to create a "skip index" for doc values.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
We consume a lot of memory for the `indexIn` slices. If `indexIn` is of
type `MemorySegmentIndexInput` the overhead of keeping loads of slices
around just for cloning is far higher than the extra 12b per reader this
adds (the slice description alone often costs a lot).
In a number of Elasticsearch example uses with high segment counts I
investigated, this change would save up to O(GB) of heap.
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that
match query terms.
With the introduction of PR #12156 we saw degradation in performance of bool queries
where one of the mandatory clauses is a TermInSetQuery with query terms not present in
the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which
would shortcut the whole bool query.
This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field
doesn't contain any query terms.
Relates to PR #12156
Merges all immutable attributes in FieldInfos.FieldNumbers into one hashmap saving memory when
writing big indices. Fixes an exotic bug when calling clear where not all attributes were cleared.
If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer.
Closes#13410
This follows a similar approach as postings and only prefetches the first page
of data.
I verified that it works well for collectors such as `TopFieldCollector`, as
`IndexSearcher` first pulls a `LeafCollector`, then a `BulkScorer` and only
then starts feeding the `BulkScorer` into the `LeafCollector`. So the
background I/O for the `LeafCollector` which will prefetch the first page of
doc values and the background I/O for the `BulkScorer` will run in parallel.
This applies to files where performing readahead could help:
- Doc values data (`.dvd`)
- Norms data (`.nvd`)
- Docs and freqs in postings lists (`.doc`)
- Points data (`.kdd`)
Other files (KNN vectors, stored fields, term vectors) keep using a `RANDOM`
advice.
We always allocate a long array of page size for a new PackedLongValues#Iterator instance, which is not necessary when packing a small number of values. this is more evident in the scenario of high-frequency flush operations
This adds `StoredFields#prefetch(int)`, which mostly delegates to
`IndexInput#prefetch`. Callers can take advantage of this API to parallelize
I/O across multiple stored documents by first calling `StoredFields#prefetch`
on all doc IDs before calling `StoredFields#document` on all doc IDs.
I added a cache of recently prefetched blocks to the default codec, in order to
avoid prefetching the same block multiple times in a short period of time. This
felt sensible given that doc ID reordering via recursive graph bisection or
index sorting are likely to result in search results being clustered.
When int4 scalar quantization was merged, it added a new way to dynamically calculate quantiles.
However, when that was merged, I inadvertently changed the default behavior, where a null confidenceInterval would actually calculate the dynamic quantiles instead of doing the previous auto-setting to 1 - 1/(dim + 1).
This commit formalizes the dynamic quantile calculate through setting the confidenceInterval to 0, and preserves the previous behavior for null confidenceIntervals so that users upgrading will not see different quantiles than they would expect.