We need to disable merges while we wait for running merges since
IW calls timed wait on it's lock that releases the monitor for the time
being which allows new merges to be registered unless we disable them.
Ensure we only rollback IW once
Today we might rollback IW more than once if we hit an exception during
the rollback code when we shutdown. This change moves the rollback code outside
the try block to ensure we always roll back but never roll back twice.
to check if a file already exists instead of opening an IndexInput
on the file which might throw a AccessDeniedException in some Directory implementations.
The DWPTPool should not release new DPWTs after it's closed. Yet, if the pool
is in a state where it's preventing new writers from being created in order to swap
the delete queue it might get closed and in that case we miss to throw an AlreadyClosedException
and release a new writer which violates the condition that the pool is empty after it's closed
and all remaining DWPTs have been aborted.
Previously, we only removed 'match all' FILTER clauses if there was at least one
MUST clause. Now they're also removed if there is another distinct FILTER clause.
This lets boolean queries like `#field:value #*:*` be written to `#field:value`.
Given the input text 'A B A C', an ordered interval 'A B C' will currently return an incorrect
internal [2, 3] in addition to the correct [0, 3] interval. This is due to a bug in the ORDERED
algorithm, where we assume that after the first interval is returned, the sub-intervals are
always in-order. This assumption only holds during minimization, as minimizing an interval
may move the earlier terms beyond the trailing terms.
For example, after the initial [0, 3] interval is found above, the algorithm will attempt to
minimize it by advancing A to [2,2]. Because this is still before C at [3,3], but after B at
[1,1], we then try advancing B, leaving it at [Inf,Inf]. Minimization has failed, so we return
the original interval of [0,3]. However, when we come to retrieve the next interval, our
subintervals look like this: A[2,2], B[Inf,Inf], C[3,3] - the assumption that they are in order
is broken. The algorithm sees that A is before B, assumes that therefore all subsequent
subintervals are in order, and returns the new interval.
This commit fixes things by changing the assumption of ordering to only hold during
minimization. When first finding a candidate interval, the algorithm now checks that
all sub-intervals appear in order.
Add IndexWriter merge-on-commit feature to selectively merge small segments on commit,
subject to a configurable timeout, to improve search performance by reducing the number of small
segments for searching.
Co-authored-by: Michael Froh <msfroh@apache.org>
Co-authored-by: Michael Sokolov <sokolov@falutin.net>
Co-authored-by: Mike McCandless <mikemccand@apache.org>
in the case of an exception it's possible that some OneMerge instances
will be closed multiple times. This commit ensures that mergeFinished is
really just called once instead of multiple times.
Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.
DWPT.DocState had some history value but today in a little bit more
cleaned up DWPT and IndexingChain there is little to no value in having
this class. It also requires explicit cleanup which is not not necessary
anymore.
This change adds infrastructure to allow straight forward waiting
on one or more merges or an entire merge specification. This is
a basis for LUCENE-8962.
Several classes within the IndexWriter indexing chain haven't been touched for
several years. Most of these classes expose their internals through public
members and are difficult to construct in tests since they depend on many other
classes. This change tries to clean up TermsHashPerField and adds a dedicated
standalone test for it to make it more accessible for other developers since
it's simpler to understand. There are also attempts to make documentation better
as a result of this refactoring.
This commit adds a new class IndexSorter which handles how a sort should be applied
to documents in an index:
* how to serialize/deserialize sort info in the segment header
* how to sort documents within a segment
* how to sort documents from merging segments
SortField has a getIndexSorter() method, which will return null if the sort cannot be used
to sort an index (eg if it uses scores or other query-dependent values). This also requires a
new Codec as there is a change to the SegmentInfoFormat