We don't need to clone the index input we hold on to in OffHeapFSTStore
since we only use it for slicing from known coordinates anyway.
-> remove the cloning and add the infrastructure to initialize
OffHeapFSTStore without seeking the input to the starting offset.
It is relatively easy to consume a massive amount of memory
for the minimize operation, with its lists of boxed Integer (even though these are mostly cached,
it's still more than 4b per instance to store them instead of plain storage) and neverending
duplicate+empty StateList instances.
The boxed integer situation we can fix and probably speedup by using the hppc primitive collections.
To fix the duplicate/empty StateList instances, we can use a constant. This requires some hacky forking
on the write path but that's about it.
This is partly motivated by ES users at times creating broken, very long prefix queries that can then eat up
GBs of heap. With this change, the examples I've been looking at become about 6x cheaper heap wise, making it
less likely that kind of mistakes impacts stability.
This includes the following changes:
- New `IndexInput#slice(String, long, long, ReadAdvice)` API that allows creating slices with different advices.
- `PosixNativeAccess` now explicitly sets `MADV_NORMAL` when called with `ReadAdvice.NORMAL`. This is required to be able to override a `RANDOM` advice of a compound file with a `NORMAL` advice of a sub file of this compound file.
- `PosixNativeAccess` now only ignores the first page if a range of bytes starts before the `MemorySegment` instead of the whole range.
The introduction of the doc values skip index in #13449 broke the backward codec test as those codecs do not support
it. This commit fix it by breaking up the base class for the tests.
Optional skip list on top of doc values which is exposed via the DocValuesSkipper abstraction. A new flag is
added to FieldType.java that configures whether to create a "skip index" for doc values.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
We consume a lot of memory for the `indexIn` slices. If `indexIn` is of
type `MemorySegmentIndexInput` the overhead of keeping loads of slices
around just for cloning is far higher than the extra 12b per reader this
adds (the slice description alone often costs a lot).
In a number of Elasticsearch example uses with high segment counts I
investigated, this change would save up to O(GB) of heap.
MultiTermQuery return null for ScoreSupplier if there are no terms in an index that
match query terms.
With the introduction of PR #12156 we saw degradation in performance of bool queries
where one of the mandatory clauses is a TermInSetQuery with query terms not present in
the field. Before for such cases TermsInSetQuery returned null for ScoreSupplier which
would shortcut the whole bool query.
This PR adds ability for MultiTermQuery to return null for ScoreSupplier if a field
doesn't contain any query terms.
Relates to PR #12156
Merges all immutable attributes in FieldInfos.FieldNumbers into one hashmap saving memory when
writing big indices. Fixes an exotic bug when calling clear where not all attributes were cleared.
If Caller requires Weight then they have to keep track of Weight with which Scorer was created in the first place instead of relying on Scorer.
Closes#13410