Sort is used in all sorts of settings where we assume that it is immutable
(for example, in IndexWriterConfig). This commit makes it so, plus it also
updates the severely outdated javadoc.
`dvGen` doesn't need to be checked for schema consistency since it is always
-1. Furthermore, this change changes the `assertSame` that takes an object to
make it take an enum instead, since it uses instance equality checks which are
generally incorrect for objects.
* Update wording in README and poll-mirrors.py
* First pass at updating wizard
- lucene/solr -> lucene
- removed solr-only tasks and python functions
* Update addVersion to remove Solr parts
- fixes bug with a regex and missing String qualifier for gradle baseVersion
* buildAndPushRelease - remove solr parts
* githubPRs.py report on PRs from new lucene repo and lucene JIRA only
* update smokeTestRelease.py example in README.md (but not smokeTestRelease.py itself)
* remove Solr references in releasedJirasRegex.py
* Update releasedJirasRegex.py
* Add gpg release signing to buildAndPushRelease.py
Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
Better document these methods directly, mentioning endianness, linking
to appropriate varhandle constant, etc.
Add blurb to MIGRATE.txt to call out the switch to little-endian to
increase awareness.
Expand the log message when CMS.MergeThread completes its merge operation,
to include addition useful diagnostic information, like the total-bytes-written,
the time taken, as well as rate limiter information. Also, while here, unify the
thread start and end log output to help improve tracing.
The gradle plugin portal uses jcenter to resolve third-party plugins, which
can be flaky. This commit instructs gradle to look first in maven central,
and only use the plugin portal for gradle's own plugins.
This commit adds a new `addDiagnostics` method to `SegmentInfo` that
allows custom merge policies to add new diagnostic information to the
segment's diagnostic map.
There was a regression introduced in
https://github.com/apache/lucene/pull/107/files#diff-49b11ced76acedf749c5a5a0ff6e7fe93b8fb64caf8697e487a56f4f7adbb510
where we moved from write logic that was optimized for every number of bits per
value to more general logic that had to work for every number of bits per value.
This PR doesn't restore as much specialization, but some middle ground that
makes flushes and merges of doc values noticeably faster (though not much
faster).
Many tests are failing due to the newly introduced chunk scoring in
AssertingBulkScorer. This commit reverts that change and will
reintroduce it later.
Even though it was not the driver for the slowdown, in LUCENE-10125 we
identified that the move to PFOR had slowed down indexing significantly
for fields indexed with indexOptions=DOCS. This patch gets some of the
peformance back by using the `LongHeap` that we introduced for vectors
instead of sorting the same array over and over again.
On the NYC Taxis benchmark, I observed ~8% faster merges of postings
with this change.
Seems that VectorFormat merge creates A LOT of these bitsets. We don't
need to do any fancy reflection here via shallowSizeOf(Object), when we
can call sizeOf(long[]) which is fast.
We may want to revisit this RAMUsageEstimator api in the future to
prevent traps like this.
Lucene90HnswVectorsFormat has a default 'beam width' of 16. This is quite low
and produces poor recall on typical-sized datasets.
This commit bumps it to 100. This new default tries to balance good search
performance with indexing speed. Most runs in ann-benchmarks set the parameter
between ~400 and 800, but they are heavily optimizing search over index speed.
We should not set single sort when the search_after is non-null;
otherwise, we will incorrectly skip documents whose values are equal to
the value from the search_after and docIDs are greater than the docID
from the search_after.
This commit adds the QueryParserBase::getFuzzyDistance protected method, which
can be overridden by subclasses to provide customisation of how the similarity distance
is determined. The default implementation retains the current behaviour.
The test fails randomly because HNSW can sometimes miss results when k is close
to the number of total docs. While we wait for a fix, this commit decreases k to
prevent failures.