A recent test failure signaled that when the simple text codec was randomly selected, byte vectors could not be written.
This commit addressed that by adding support for writing byte vectors to SimpleTextKnnVectorsWriter.
Note that while support is added to the BufferingKnnVectorsWriter base class, 90, 91 and 92 writers don't need to support
byte vectors and will throw unsupported operation exception when attempting to do that.
Follow-up of #12105 to remove the deprecated classes for the next major version.
Removes KnnVectorField, KnnVectorQuery, VectorValues and LeafReader#getVectorValues.
When a field indexes numeric doc values, `MemoryIndex` does an unchecked cast
to `java.lang.Long`. However, the new `IntField` represents the value as a
`java.lang.Integer` so this cast fails. This commit aligns `MemoryIndex` with
`IndexingChain` by casting to `Number` and calling `Number#longValue` instead
of casting to `Long`.
We recently introduced KnnByteVectorField, KnnByteVectorQuery and ByteVectorValues. The corresponding float variants of the same classes don't follow the same naming convention: KnnVectorField, KnnVectoryQuery and VectorValues. Ideally their names would reflect that they are the float variant of the vector field, vector query and vector values.
This commit aims at clarifying this in the public facing API, by deprecating the current float classes in favour of new ones that are their exact copy but follow the same naming conventions as the byte ones.
As a result, LeafReader#getVectorValues is also deprecated in favour of newly introduced getFloatVectorValues method that returns FloatVectorValues.
Relates to #11963
The main classes involved are ByteVectorValues, KnnByteVectorField and KnnByteVectorQuery. It becomes quite natural to simplify things further and use byte[] in the following methods too: ByteVectorValues#vectorValue, KnnVectorReader#search, LeafReader#searchNearestVectors, HNSWGraphSearcher#search, VectorSimilarityFunction#compare, VectorUtil#cosine, VectorUtil#squareDistance, VectorUtil#dotProduct, VectorUtil#dotProductScore
This method tries to expose an encoded view of vectors, but we shouldn't have
this part of our user-facing API. With this change, the way vectors are encoded
is entirely on the codec.
Clean up this query a bit and support:
* NumericDocValuesField.newSlowSetQuery()
* SortedNumericDocValuesField.newSlowSetQuery()
This complements the existing docvalues-based range queries, with a set query.
Add ScorerSupplier/cost estimation support to PointInSetQuery
Add newSetQuery() to IntField/LongField/DoubleField/FloatField, that uses IndexOrDocValuesQuery
WeightedSpanTermExtractor will try to rewrite queries that it doesn't
know about, to see if they end up as something it does know about and
that it can extract terms from. To support field merging, it rewrites against
a delegating leaf reader that does not support getFieldInfos().
FieldExistsQuery uses getFieldInfos() in its rewrite, which means that
if one is passed to WeightedSpanTermExtractor, we get an
UnsupportedOperationException thrown.
This commit makes WeightedSpanTermExtractor aware of FieldExistsQuery,
so that it can just ignore it and avoid throwing an exception.
When sub collectors don't agree on their `ScoreMode`, `MultiCollector`
currently returns `COMPLETE`. This makes sense when assuming that there is
likely one collector computing top hits (`TOP_SCORES`) and another one
computing facets (`COMPLETE_NO_SCORES`) so `COMPLETE` makes sense. However it
is also possible to have one collector computing top hits by field (`TOP_DOCS`)
and another one doing facets (`COMPLETE_NO_SCORES`), and `MultiCollector`
shouldn't report that scores are needed in that case.
In the case when an index is sorted on a low-cardinality field, or the index
sort order correlates with the order in which documents get ingested, we can
optimize `SortedDocIDMerger` by doing a single comparison with the doc ID on
the next sub. This checks covers at the same time whether the priority queue
needs reordering and whether the current sub reached `NO_MORE_DOCS`.
On the NYC taxis dataset on my local machine, switching from
`Arrays#compareUnsigned` to `ArrayUtil#getUnsignedComparator` yielded a 15%
speedup of BKD merging.
When #672 was introduced, it added many nice rewrite optimizations. However, in the case when there are many multiple nested Boolean queries under a top level Boolean#filter clause, its runtime grows exponentially.
The key issue was how the BooleanQuery#rewriteNoScoring redirected yet again to the ConstantScoreQuery#rewrite. This causes BooleanQuery#rewrite to be called again recursively , even though it was previously called in ConstantScoreQuery#rewrite, and THEN BooleanQuery#rewriteNoScoring is called again, recursively.
This causes exponential growth in rewrite time based on query depth. The change here hopes to short-circuit that and only grow (near) linearly by calling BooleanQuery#rewriteNoScoring directly, instead if attempting to redirect through ConstantScoreQuery#rewrite.
closes: #12069
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.
This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.
Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
Today Lucene allows creating indexed binary fields, e.g. via
`StringField(String, BytesRef, Field.Store)`, but not reusing them: calling
`setBytesValue` on a `StringField` throws.
This commit removes the check that prevents reusing fields with binary values.
I considered an alternative that consisted of failing if calling
`setBytesValue` on a field that is indexed and tokenized, but we currently
don't have such checks e.g. on numeric values, so it did not feel consistent.
Doing this change would help improve the [nightly benchmarks for the NYC taxis
dataset](http://people.apache.org/~mikemccand/lucenebench/sparseResults.html)
by doing the String -> UTF-8 conversion only once for keywords, instead of once
for the `StringField` and one for the `SortedDocValuesField`, while still
reusing fields.
`ConcurrentMergeScheduler` uses the rate at which a merge writes bytes as a
proxy for CPU usage, in order to prevent merging from disrupting searches too
much. However creating compound files are lightweight CPU-wise and do not need
throttling.
Closes#12068