This commit also adds more checks about the values of `numChunks`,
`numDirtyChunks` and `numDirtyDocs` that would have helped discover this issue
earlier.
Add a simple regeneration help doc
Improve task help and checksum failure message (include corresponding regeneration task). Sorry for being verbose. Maybe somebody will read it. :)
Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
This adds a bit of simplicity as the file is a simple domain list,
rather than a DNS zone. So the regexes parsing DNS can be removed.
Also the file may change less often as it contains JUST the list of
TLDs, and not any additional DNS metadata.
This makes regenerate idempotent by removing the new Date() from the
output.
We already have the root.zone's Last-Modified date, which is the one
that matters and only changes when the root.zone changes.
Fails the linter if an exception is swallowed (e.g. variable completely
unused).
If this is intentional for some reason, the exception can simply by
annotated with @SuppressWarnings("unused").
This enables quite a few javac warnings from java11+ that weren't
enabled for some reason. None of them fail, so lock them in.
Additionally some newer checks are only recognized for newer JDK
versions, so they are only enabled based on the javac version used. They
will cause no annoyance because they relate to newer language features.
Require that the segment has enough dirty documents to create a clean
chunk before recompressing during merge, there must be at least maxChunkSize.
This prevents wasteful recompression with small flushes (e.g. every
document): we ensure recompression achieves some "permanent" progress.
Expose maxDocsPerChunk as a parameter for Term vectors too, matching the
stored fields format. This allows for easy testing.
Increment numDirtyDocs for partially optimized merges:
If segment N needs recompression, we have to flush any buffered docs
before bulk-copying segment N+1. Don't just increment numDirtyChunks,
also make sure numDirtyDocs is incremented, too.
This doesn't have a performance impact, and is unrelated to tooDirty()
improvements, but it is easier to reason about things with correct
statistics in the index.
Further tuning of how dirtiness is measured: for simplification just use percentage
of dirty chunks.
Co-authored-by: Adrien Grand <jpountz@gmail.com>