LUCENE-9334 requires that docs have the same schema
across the whole schema.
This fixes the test that attempts to modify schema of "number" field
from DocValues and Points to just DocValues.
Relates to #11
Require consistency between data-structures on a per-field basis
A field must be indexed with the same index options and data-structures across
all documents. Thus, for example, it is not allowed to have one document
where a certain field is indexed with doc values and points, and another document
where the same field is indexed only with points.
But it is allowed for a document not to have a certain field at all.
As a consequence of this, doc values updates are
only applicable for fields that are indexed with doc values only.
This commit removes `ramBytesUsed()` from `CodecReader` and all file formats
besides vectors, which is the only remaining file format that might use lots of
memory in the default codec. I left `ramBytesUsed()` on the `completion` format
too, which is another feature that could use lots of memory.
Other components that relied on being able to compute memory usage of readers
like facets' TaxonomyReader and the analyzing suggester assume that readers have
a RAM usage of 0 now.
This commit also adds more checks about the values of `numChunks`,
`numDirtyChunks` and `numDirtyDocs` that would have helped discover this issue
earlier.
Add a simple regeneration help doc
Improve task help and checksum failure message (include corresponding regeneration task). Sorry for being verbose. Maybe somebody will read it. :)
Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
This adds a bit of simplicity as the file is a simple domain list,
rather than a DNS zone. So the regexes parsing DNS can be removed.
Also the file may change less often as it contains JUST the list of
TLDs, and not any additional DNS metadata.
This makes regenerate idempotent by removing the new Date() from the
output.
We already have the root.zone's Last-Modified date, which is the one
that matters and only changes when the root.zone changes.
Fails the linter if an exception is swallowed (e.g. variable completely
unused).
If this is intentional for some reason, the exception can simply by
annotated with @SuppressWarnings("unused").
This enables quite a few javac warnings from java11+ that weren't
enabled for some reason. None of them fail, so lock them in.
Additionally some newer checks are only recognized for newer JDK
versions, so they are only enabled based on the javac version used. They
will cause no annoyance because they relate to newer language features.
Require that the segment has enough dirty documents to create a clean
chunk before recompressing during merge, there must be at least maxChunkSize.
This prevents wasteful recompression with small flushes (e.g. every
document): we ensure recompression achieves some "permanent" progress.
Expose maxDocsPerChunk as a parameter for Term vectors too, matching the
stored fields format. This allows for easy testing.
Increment numDirtyDocs for partially optimized merges:
If segment N needs recompression, we have to flush any buffered docs
before bulk-copying segment N+1. Don't just increment numDirtyChunks,
also make sure numDirtyDocs is incremented, too.
This doesn't have a performance impact, and is unrelated to tooDirty()
improvements, but it is easier to reason about things with correct
statistics in the index.
Further tuning of how dirtiness is measured: for simplification just use percentage
of dirty chunks.
Co-authored-by: Adrien Grand <jpountz@gmail.com>