* Build Lucene binary distribution using Gradle
* Generate SHA-512 checksums for all release artifacts
* Update documentation artifacts included in binaries
* Delete some additional Ant relics
Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
Co-authored-by: Uwe Schindler <uschindler@apache.org>
PR #1351 introduced a sort optimization where documents can be skipped.
But iteration over competitive iterators was not properly organized,
as they were not storing the current docID, and
when competitive iterator was updated the current doc ID was lost.
This patch fixed it.
Relates to #1351
PR #1351 introduced a sort optimization where
documents can be skipped.
But there was a bug in case we were using two phase
approximation, as we would advance it without advancing
an overall conjunction iterator.
This patch fixed it.
Relates to #1351
Some applications can use the pendingNumDocs from IndexWriter to
estimate that the number of documents of an index is very close to the
hard limit so that it can reject writes without constructing Lucene
documents.
Can happen with catenateAll and not generating word xor number part when the input ends with the non-generated sub-token.
Fuzzing revealed that only start & end offsets are needed to order sub-tokens.
LUCENE-9444: add utility class to retrieve facet labels from the taxonomy index for a facet field so such fields do not also have to be redundantly stored in the index.
Co-authored-by: Ankur Goel <goankur@amazon.com>
Currently we calculate the ramBytesUsed by the DWPT under the flushControl
lock. We can do this calculation safely outside of the lock without any downside.
The FlushControl lock should be used with care since it's a central part of indexing
and might block all indexing.
SortingCodecReader keeps all docvalues in memory that are loaded from this reader.
Yet, this reader should only be used for merging which happens sequentially. This makes
caching docvalues unnecessary.
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
The problem of tracking dirtiness via numbers of chunks is that larger
chunks make stored fields readers more likely to be considered dirty, so
I'm trying to work around it by tracking numbers of docs instead.
The increase of the maximum number of chunks per doc done in previous
issues was mostly random. I'd like to provide users with a similar
trade-off with what the old versions of BEST_SPEED and BEST_COMPRESSION
used to do. So since BEST_SPEED used to compress at most 128 docs at
once, I think we should roughly make it 128*10 now since there are 10
sub blocks. I made it 1024 to account for the fact that there is a preset
dict as well that need decompressing. And similarly BEST_COMPRESSION used
to allow 4x more docs than BEST_SPEED, so I made it 4096.
With such larger numbers of docs per chunk, the decoding of metadata
became a bottleneck for stored field access so I made it a bit faster by
doing bulk decoding of the packed longs.
Instead of configuring a dictionary size and a block size, the format
now tries to have 10 sub blocks per bigger block, and adapts the size of
the dictionary and of the sub blocks to this overall block size.
When index sorting is enabled, stored fields and term vectors can't be
written on the fly like in the normal case, so they are written into
temporary files that then get resorted. For these temporary files,
disabling compression speeds up indexing significantly.
On a synthetic test that indexes stored fields and a doc value field
populated with random values that is used for index sorting, this
resulted in a 3x indexing speedup.
* Restore lucene/version.properties
* Switch release wizard commands from ant to gradle equivalents
* Remove remaining checks for ant
* Remove checks for Java 8
* Update Copyright year
* Minor bug fixes around determining next version for a major release