When FMA is not supported by the hardware, these methods fall back to
BigDecimal usage which causes them to be 2500x slower.
While most hardware in the last 10 years may have the support, out of
box both VirtualBox and QEMU don't pass thru FMA support (for the latter
at least you can tweak it with e.g. -cpu host or similar to fix this).
This creates a terrible undocumented performance trap. Prevent it from
sneaking into our codebase.
Drop 3.x compatibility (which was pickier at compile-time and prevented slow things from happening). Instead add paranoia to runtime tests, so that they fail if antlr would do something slow in the parsing. This is needed because antlrv4 is a big performance trap: https://github.com/antlr/antlr4/blob/master/doc/faq/general.md
"Q: What are the main design decisions in ANTLR4?
Ease-of-use over performance. I will worry about performance later."
It allows us to move forward with newer antlr but hopefully prevent the associated headaches.
Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
Co-authored-by: Robert Muir <rmuir@apache.org>
Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.
Co-authored-by: Adrien Grand <jpountz@gmail.com>
* Leverage DISI static factory methods more over custom DISI impl where possible.
* Assert points field is a single-dim.
* Bound cost estimate by the cost of the doc values field (for sparse fields).
`RandomAccessVectorValues` is internally used in our HNSW implementation to
provide random access to vectors, both at index and search time. In order to
better reflect this, this change does the following:
- `RandomAccessVectorValues` moves to `org.apache.lucene.util.hnsw`.
- `BufferingKnnVectorsWriter` no longer has a dependency on
`RandomAccessVectorValues` and moves to `org.apache.lucene.codecs` since
it's more of a utility class for KNN vector file formats than an index API.
Maybe we should think of moving it near each file format that uses it
instead.
- `SortingCodecReader` no longer has a dependency on
`RandomAccessVectorValues`.
Closes#10623
This generalizes #687 to indexes that are sorted in descending order. The main
challenge with descending sorts is that they require being able to compute the
last doc ID that matches a value, which would ideally require walking the BKD
tree in reverse order, but the API only support moving forward. This is worked
around by maintaining a stack of `PointTree` clones to perform the search.
These are easy/obvious ones to disable since we don't use the
functionality at all: the checks are literally useless.
This gives some performance boost to the error-prone, although it is
still pretty slow.
triage most of the previously disabled checks into TODO, noisy, etc
PassageScorer uses a priority queue of size maxPassages to keep track of
which highlighted passages are worth returning to the user. Once all
passages have been collected, we go through and merge overlapping
passages together, but this reduction in the number of passages is not
compensated for by re-adding the highest-scoring passages that were pushed
out of the queue by passages which have been merged away.
This commit increases the size of the priority queue to try and account for
overlapping passages that will subsequently be merged together.
ExitableTerms should not iterate through the terms to retrieve min and max when the wrapped implementation has the values cached (e.g. FieldsReader, OrdsFieldReader)
OffsetsFromMatchIterator and OffsetsFromPositions both have package-
private constructors, which makes them difficult to use as components in a
separate highlighter implementation.
This does not change the semantics or performance of our setup.
Instead, it explicitly enables checks that we want vs disabling checks
that we don't want.
Also reordered checks to match the error-prone website list of checks
for easier maintenance.
It is now clear that many useless checks are enabled, we can disable
some of them and try to get the performance reasonable.
`VectorValues` have a `cost()` method that reports an approximate number of
documents that have a vector, but also a `size()` method that reports the
accurate number of vectors in the field. Since KNN vectors only support
single-valued fields we should enforce that `cost()` returns the `size()`.