The only behaviour that QueueSizeBasedExecutor overrides from SliceExecutor is when to execute on the caller thread. There is no need to override the whole invokeAll method for that. Instead, this commit introduces a shouldExecuteOnCallerThread method that can be overridden.
QueryTimeout was introduced together with ExitableDirectoryReader but is
now also optionally set to the IndexSearcher to wrap the bulk scorer
with a TimeLimitingBulkScorer. Its javadocs needs updating.
There's a couple of places in the Exitable wrapper classes where
queryTimeout is set within the constructor and never modified. This
commit makes such members final.
The term member of TermAndBoost used to be a Term instance and became a
BytesRef with #11941, which means its equals impl won't take the field
name into account. The SynonymQuery equals impl needs to be updated
accordingly to take the field into account as well, otherwise synonym
queries with same term and boost across different fields are equal which
is a bug.
Today there is no specific ordering of how files are written to a compound file.
The current order is determined by iterating over the set of file names in
SegmentInfo, which is undefined. This commit changes to an order based
on file size. Colocating data from files that are smaller (typically metadata
files like terms index, field info etc...) but accessed often can help when
parts of these files are held in cache.
QueryProfilerWeight should override matches and delegate to the
subQueryWeight. Another way to fix this issue is to make it extend
ProfileWeight and override only methods that need to have a different
behaviour than delegating to the sub weight.
* Hunspell: reduce suggestion set dependency on the hash table order
When adding words to a dictionary, suggestions for other words shouldn't change unless they're directly related to the added words.
But before, GeneratingSuggester selected 100 best first matches from the hash table, whose order can change significantly after adding any unrelated word.
That resulted in unexpected suggestion changes on seemingly unrelated dictionary edits.
- Ant is no longer used as the build system for Lucene
- JUnit is not packaged in a Lucene release
- The Float16Converter was removed before the PR it was used in was merged:
https://github.com/apache/lucene-solr/pull/2108
if we only have custom-case uART and capitalized UART, we shouldn't accept StandUart as a compound (although we keep hidden "Uart" dictionary entries for internal purposes)
After upgrading Elasticsearch to a recent Lucene snapshot, we observed a few
indexing slowdowns when indexing with low numbers of cores. This appears to be
due to the fact that we lost too much of the bias towards larger DWPTs in
apache/lucene#12199. This change tries to add back more ordering by adjusting
the concurrency of `DWPTPool` to the number of cores that are available on the
local node.
Given an input text 'A B A C A B C' and search ORDERED(A, B, C), we should
retrieve hits [0,3] and [4,6]; currently [4,6] is skipped.
After finding the first interval [0, 3], the subintervals will become A[0,0], B[1,1],
C[3,3]; then the algorithm will try to minimize it and the subintervals will
become: A:[2,2], B:[5,5], C:[3,3] (after finding 5 > 3 it breaks the minimization)
And when finding next interval, it will do advance(B) before checking whether
it is after A(the do-while loop), so subintervals will become A[2,2], B[inf, inf],
C[3,3] and return NO_MORE_INTERVAL.
This commit instead continues advancing subintervals from where the last
`nextInterval` call stopped, rather than always advancing all subintervals.
Currently we're only half reusing postings enums when flushing sorted indexes
as we still create new wrapper instances every time, which can be costly with
fields that have many terms.
Obtaining a DWPT and putting it back into the pool is subject to contention.
This change reduces contention by using 8 sub pools that are tried sequentially.
When applied on top of #12198, this reduces the time to index geonames with 20
threads from ~19s to ~16-17s.
This switches to LSBRadixSorter instead of TimSorter to sort postings whose
index options are `DOCS`. On a synthetic benchmark this yielded barely any
difference in the case when the index order is the same as the sort order, or
reverse, but almost a 3x speedup for writing postings in the case when the
index order is mostly random.
lucene-util's `IndexGeoNames` benchmark is heavily contended when running with
many indexing threads, 20 in my case. The main offender is
`DocumentsWriterFlushControl#doAfterDocument`, which runs after every index
operation to update doc and RAM accounting.
This change reduces contention by only updating RAM accounting if the amount of
RAM consumption that has not been committed yet by a single DWPT is at least
0.1% of the total RAM buffer size. This effectively batches updates to RAM
accounting, similarly to what happens when using `IndexWriter#addDocuments` to
index multiple documents at once. Since updates to RAM accounting may be
batched, `FlushPolicy` can no longer distinguish between inserts, updates and
deletes, so all 3 methods got merged into a single one.
With this change, `IndexGeoNames` goes from ~22s to ~19s and the main offender
for contention is now `DocumentsWriterPerThreadPool#getAndLock`.
Co-authored-by: Simon Willnauer <simonw@apache.org>