IndexWriter#numDeletesToMerge was creating a ReadersAndUpdates
for all incoming SegmentCommitInfo even if that info wasn't private
to the IndexWriter. This is an illegal use of this API but since it's
transitively public via MergePolicy#findMerges we have to be conservative
with regestiering ReadersAndUpdates. In IndexWriter#numDeletesToMerge we
can only use existing ones. This means for soft-deletes we need to react
earlier in order to produce accurate numbers.
This change partially rolls back the changes in LUCENE-8253. Instead of
registering the readers once they are pulled via IndexWriter#numDeletesToMerge
we now check if segments are fully deleted on flush which is very unlikely and
can be done in a lazy fashion ie. it's only paying the extra cost of opening a
reader and checking all soft-deletes if soft deletes are used and present
in the flushed segment.
This has the side-effect that flushed segments that are 100% hard deleted are also
cleaned up right after they are flushed, previously these segments were sticking
around for a while until they got picked for a merge or received another delete.
This also closes LUCENE-8256
Inside the IndexWriter buffers are only written to disk if it's needed
or "worth it" which doesn't guarantee soft deletes to be accounted
in time. This is not necessarily a problem since they are eventually
collected and segments that have soft-deletes will me merged eventually
but for tests and on par behavior compared to hard deletes this behavior
is tricky.
This change cuts over to accounting in-place just like hard-deletes. This
results in accurate delete numbers for soft deletes at any give point in time
once the reader is loaded or a pending soft delete occurs.
This change also fixes an issue where all updates to a DV field are allowed
event if the field is unknown. Now this only works if the field is equal
to the soft deletes field. This behavior was never released.
This change adds a korean analyzer in a new analysis module named nori. It is similar
to Kuromoji but uses the mecab-ko-dic dictionary to perform morphological analysis of Korean
text.
We drop changes after we finish a merge, this has also reset
the DV generation the PendingSoftDeletes were initialized on causing
assertions to trip if releaseing the reader was writing DV to disk.
This change removes resetting the dv generation to make assertions
hold which requried to keep the pending change count on PendingSoftDeletes.