With this change, doc-value terms dictionaries use a shared `ByteBlockPool`
across all fields, and points, binary doc values and doc-value ordinals use
slightly smaller page sizes.
DWPTPool currently always returns the last DWPT that was added to the
pool. By returning the largest DWPT instead, we could try to do larger
flushes by finishing DWPTs that are close to being full instead of the
last one that was added to the pool, which might be close to being
empty.
When indexing wikimediumall, this change did not seem to improve the
indexing rate significantly, but it didn't slow things down either and
the number of flushes went from 224-226 to 216, about 4% less.
My expectation is that our nightly benchmarks are a best-case scenario
for DWPTPool as the same number of threads is dedicated to indexing over
time, but in the case when you have e.g. a single fixed threadpool that
is responsible for indexing into several indices, the number of indexing
threads that contribute to a given index might greatly vary over time.
The newly added assertion in the bulk-merge logic doesn't always hold
because we do not create a new instance of
Lucene90CompressingTermVectorsReader for merges and that reader can be
accessed in tests (as long as it happens on the same thread).
This change clones a new term vectors reader for merges.
This change enables bulk-merge for term vectors with index sort. The
algorithm used here is similar to the one that is used to merge stored
fields.
Relates #134
This change enables bulk-merge for term vectors with index sort. The
algorithm used here is similar to the one that is used to merge stored
fields.
Relates #134
This change adds new IndexSearcher and Collector implementations to profile
search execution and break down the timings. The breakdown includes the total
time spent in each of the following categories along with the number of times
visited: create weight, build scorer, next doc, advance, score, match.
Co-authored-by: Julie Tibshirani <julietibs@gmail.com>
Before when creating a VectorWriter for merging, we would always load the
default implementation. So if the format was configured with parameters, they
were ignored.
This issue was caught by `TestKnnGraph#testMergeProducesSameGraph`.
Previously, the max connections and beam width parameters could be configured as
field type attributes. This PR moves them to be parameters on
Lucene90HnswVectorFormat, to avoid exposing details of the vector format
implementation in the API.
Boosts are ignored on inner span queries, and top-level boosts can
be applied by using a normal BoostQuery, so SpanBoostQuery
itself is redundant and trappy. This commit removes it entirely.
Before, rewriting could slightly change the scoring when weights were
specified. We now rewrite less aggressively to avoid changing the query's
behavior.
If we fail to delete files that belong to a commit point, then we will
expose that deleted commit in the next calls of IndexDeletionPolicy#onCommit.
I think we should never expose those deleted commit points as
some of their files might have been deleted already.
TestMatchesIterator lives in core/tests and does various sanity checks
on the matches returned by various queries, including Span queries.
The Span-specific tests cannot stay here once Spans have been moved
out of core. This commit pulls various helper methods from this class
into a base class in the test framework, so that we can move the
Spans tests into their own class and keep coverage once things have
been migrated.
We have a number of helper classes in o.a.l.search that aid the
implementation of two-phase iteration over disjunctions. These have
some Spans-specific code, which will stop compiling once Spans
are moved into the queries module. This commit removes the
Spans references from the main code and duplicates the helper
code within the Spans package.