Today it looks like wild wild west inside IndexWriter and some of it's
associated classes. This change makes sure all non-final members have
private visibility, methods that are not used outside of IW today are
made private unless they have been public. This change also removes
some unused or unnecessary members where possible and deleted some dead
code from previous refactoring.
Today we still have one class that runs some tricky logic that should
be in the IndexWriter in the first place since it requires locking on
the IndexWriter itself. This change inverts the API and now FrozendBufferedUpdates
does not get the IndexWriter passed in, instead the IndexWriter owns most of the logic
and executes on a FrozenBufferedUpdates object. This prevent locking on IndexWriter out
side of the writer itself and paves the way to simplify some concurrency down the road
This change extracts the methods that are used by MergeScheduler into
a MergeSource interface. This allows IndexWriter to better ensure
locking, hide internal methods and removes the tight coupling between the two
complex classes. This will also improve future testing.
This method is trappy; it doesn't work for all SortField types, but doesn't tell
you that until runtime. This commit deprecates it, and removes all other
callsites in the codebase.
Replaces SimpleBindings' Map<String, Object> with a map of
Function<Bindings, DoubleValuesSource> to improve type safety, and
reworks cycle detection and validation to avoid catching
StackOverflowException
This test produced tons of files on nighly builds causing
TooManyOpenFilesExceptions likely due to not using CFS on flush
and/or very small maxMergeSize values.
IW#maybeMerge calls the MergeScheduler even if it didn't find any merges we should instead only do this if there is in-fact anything there to merge and safe the call into a sync'd method.
CMS today releases it's lock after finishing a merge before it re-acquires it to update
the thread accounting datastructures. This causes threading issues where concurrently
finishing threads fail to pick up pending merges causing potential thread starvation
on forceMerge calls.
Speed up geometry queries by specialising Component2D spatial operations. Instead of using a generic relate method for all relations, we use specialise methods for each one. In addition, the type of triangle is computed at deserialisation time, therefore we can be more selective when decoding points of a triangle
We already have IDs in SegmentInfo, as well as on SegmentInfos which are useful to uniquely identify segments and entire commits. Having IDs on SegmentCommitInfo is be useful too in
order to compare commits for equality and make snapshots incremental on generational files.
This change adds a unique ID to SegmentCommitInfo starting from Lucene 8.6. Older segments won't have an ID until the segment receives an update or a delete even if they have been opened and / or committed by Lucene 8.6 or above.
This change moves the deletes tracked by FrozenBufferedUpdates that
are private to the DWPT and never used in a global context out of
FrozenBufferedUpdates.
After recent refactoring on LUCENE-9304 `IW#getMaxCompletedSequenceNumber()` might
return values that belong to non-completed operations if a full flush is running, a new delete
queue is already in place but not all DWPTs that participate in the full flush have finished it's in
flight operation. This caused rare failures in
`TestControlledRealTimeReopenThread#testControlledRealTimeReopenThread` where
documents are not actually visible given the max completed seqNo. This change streamlines
the delete queue advance, adds a dedicated testcase and ensures that a delete queues
sequence Id space is never exhausted.
This change removes the ThreadState indirection from DWPTPool and pools DWPT directly. The tracking information and locking semantics are mostly moved to DWPT directly and the pool semantics have changed slightly such that DWPT need to be checked-out in the pool once they need to be flushed or aborted. This automatically grows and shrinks the number of DWPT in the system when number of threads grow or shrink. Access of pooled DWPTs is more straight forward and doesn't require ordinal. Instead consumers can just iterate over the elements in the pool.
This allowed for removal of indirections in DWPTFlushControl like BlockedFlush, the removal of DWPTPool setter and getter in IndexWriterConfig and the addition of stronger assertions in DWPT and DW.
This test failed on Elastic CI because we did not add any term in the
loop. This commit ensures that we always add at least one docId, term
and query in the test.
The SegmentMerger usage in IW#addIndexes(CodecReader...) might make changes
to the Directory while the IW tries to clean-up files on rollback. This
causes issues like FileNotFoundExceptions when IDF tries to remove temp files.
This changes adds a waiting mechanism to the abortMerges method that, in addition
to the running merges, also waits for merges in addIndices(CodecReader...)
Today a doc values update creates a new field infos file that contains the original field infos updated for the new generation as well as the new fields created by the doc values update.
However existing fields are cloned through the global fields (shared in the index writer) instead of the local ones (present in the segment).
In practice this is not an issue since field numbers are shared between segments created by the same index writer.
But this assumption doesn't hold for segments created by different writers and added through IndexWriter#addIndexes(Directory).
In this case, the field number of the same field can differ between segments so any doc values update can corrupt the index
by assigning the wrong field number to an existing field in the next generation.
When this happens, queries and merges can access wrong fields without throwing any error, leading to a silent corruption in the index.
This change ensures that we preserve local field numbers when creating
a new field infos generation.
This commit introduces a mechanism to control allocation of threads to slices planned for a query.
The default implementation uses the size of backlog queue of the executor to determine if a slice should be allocated a new thread