This test sometimes fails because `SimpleText` has a non-deterministic size for
its segment info file, due to escape characters. The test now enforces the
default codec, and checks that segments have the expected size before moving
forward with forcemerge().
Closes#12648
The code was written as if frequencies should be lazily decoding, except that
when refilling buffers freqs were getting eagerly decoded instead of lazily.
### Description
This PR addresses the issue #12394. It adds an API **`similarityToQueryVector`** to `DoubleValuesSource` to compute vector similarity scores between the query vector and the `KnnByteVectorField`/`KnnFloatVectorField` for documents using the 2 new DVS implementations (`ByteVectorSimilarityValuesSource` for byte vectors and `FloatVectorSimilarityValuesSource` for float vectors). Below are the method signatures added to DVS in this PR:
- `DoubleValues similarityToQueryVector(LeafReaderContext ctx, float[] queryVector, String vectorField)` *(uses ByteVectorSimilarityValuesSource)*
- `DoubleValues similarityToQueryVector(LeafReaderContext ctx, byte[] queryVector, String vectorField)` *(uses FloatVectorSimilarityValuesSource)*
Closes#12394
DocumentsWriter had some duplicate logic for iterating over
segments to be flushed. This change simplifies some of the loops
and moves common code in on place. This also adds tests to ensure
we actually freeze and apply deletes on segment flush.
Relates to #12572
While going through: https://github.com/apache/lucene/pull/12582
I noticed that for a while now, our offheap vector readers haven't changed at all. We just keep copying them around for no reason.
To make adding a new vector codec simpler, this refactors the lucene95 codec to allow its offheap vector storage format (readers/writers) to be used.
Additionally, it will handle writing the appropriate fields for sparse vectors (read/write) to a provided index output/inputs.
This should reduce the churn in new codecs significantly.
Currently FSTCompiler and FST have circular dependencies to each
other. FSTCompiler creates an instance of FST, and on adding node
(add(IntsRef input, T output)), it delegates to FST.addNode() and passes
itself as a variable. This introduces a circular dependency and mixes up
the FST constructing and traversing code.
To make matter worse, this implies one can call FST.addNode with an
arbitrary FSTCompiler (as it's a parameter), but in reality it should be
the compiler which creates the FST.
This commit moves the addNode method to FSTCompiler instead.
Co-authored-by: Anh Dung Bui <buidun@amazon.com>
e.g. by user responding with ^D
```
Press (n)ext page, (q)uit or enter number to jump to a page.
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "String.length()" because "line" is null
at org.apache.lucene.demo.SearchFiles.doPagingSearch(SearchFiles.java:244)
at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:152)
```
```
Press (p)revious page, (n)ext page, (q)uit or enter number to jump to a page.
n
Only results 1 - 50 of 104 total matching documents collected.
Collect more (y/n) ?
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "String.length()" because "line" is null
at org.apache.lucene.demo.SearchFiles.doPagingSearch(SearchFiles.java:198)
at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:152)
```
Co-authored-by: Piotrek Żygieło <pzygielo@users.noreply.github.com>
MaxScoreBulkScorer computes windows based on the set of clauses that were
essential in the *previous* window. This usually works well as the set of
essential clauses tends to be stable over time, but there are cases when
clauses get swapped between essential and non-essential clauses, and computing
windows based on the previous window can lead to suboptimal choices.
This PR creates a first proposal for the next score window using essential
clauses from the previous window, and then creates a second proposal once
scorers have been partitioned and their max scores have been updated. If this
second proposal results in a smaller window, it gets used.
On one particular query (`the incredibles`) and a reordered index with BP
(which increases chances that scorers move from essential to non-essential or
vice-versa), this change yielded a 2.3x speedup.
The TaskExecutor used to run concurrent operations may leave running tasks behind when an exception is thrown by one of the tasks. This commit ensures that it instead waits for all tasks to complete before it re-throws the exception. If there's more than one exception thrown, they are going to be added as suppressed exceptions to the first one that was caught.
This is a minor refactor of HNSW graph merging logic.
Instead of directly checking the KnnVectorReader version, this commit adjusts the logic to see if a specific interface is satisfied for returning a view of the HnswGraph.
Given a query that implements Accountable, the LRUQueryCache would increment
its internal accounting by the amount reported by Accountable.ramBytesUsed(), but
only decrement on eviction by the default used for all other queries. This meant that
the cache could eventually think it had run out of space, even if there were no queries
in it at all. This commit ensures that queries that implement Accountable are always
accounted for correctly.
As we introduce more places where we add concurrency (there are
currently three) there is a common pattern around checking whether there
is an executor provided, and then going sequential on the caller thread
or parallel relying on the executor.
That can be improved by internally creating a TaskExecutor that relies
on an executor that executes tasks on the caller thread, which ensures
that the task executor is never null, hence the common conditional is no
longer needed, as the concurrent path that uses the task executor would
be the default and only choice for operations that can be parallelized.