This specific commit affects all points in the casebase where the argument of a StringBuilder.append() call is itself a regular String concatenation.
This defeats the purpose of using StringBuilder and also introduces an extra alloction.
These changes should avoid that.
ant tests have run, succeeded on local machine.
Removing test files from the changes.
Another suggested rework.
The current slicing algorithm assigns a thread per segment, which
can be detrimental to performance in case the distribution has
a large number of small segments. The patch introduces a slicing
algorithm which coalesces smaller segments to a single thread,
thus reducing the impact of context switching by limiting the
number of threads
Signed-off-by: Adrien Grand <jpountz@gmail.com>
Reader attributes allows a per IndexReader configuration of codec internals.
For instance this allows a per reader configuration if FSTs are loaded into memory or are left
on disk.
In order to prevent ConcurrentModificationException this change makes
an unmodifiable copy on write for all maps in SegmentInfo. MergePolicies
can access these maps without synchronization and cause exceptions if
it's modified in the merge thread.
Before we can expose options to configure this postings format
on a per-reader basis we need to expose the option to load the terms
index FST off or on heap on the postings format. This already allows to
change the default in a per-field posting format if an expert user
wants to change the defaults. This essentially provides the ability to change
defaults globally while still involving some glue code.
FilterDirectory.getPendingDeletions() did not delegate the call, which
resulted in a new IndexWriter on same directory not considering pending
delete files. This could in turn result in a FileAlreadyExistsException
when running windows.
Today we never load FSTs of ID-like fields off-heap since we need
very fast access for updates. Yet, a reader that is not loaded from
an IndexWriter can also leave the FST on disk. This change adds
this information to SegmentReadState to allow the postings format
to make this decision without configuration.
This commit adds an introspection API to Query, allowing users to traverse
the nested structure of a query and examine its leaves. It replaces the existing
`extractTerms` method on Weight, and alters some highlighting code to use
the new API
This is a spinn-off from `LUCENE-8700` that is satisfied by IndexWriter#flushNextBuffer.
The idea here is to additionally call flushNextBuffer in RandomIndexWriter for better
test coverage. This is a test-only change.
This can cause spurious failures when run in conjunction with HandleLimitFS,
as we can end up with lots of very small segments which trips the file handles
limit
Today we have #numDocs() and #maxDoc() on IndexWriter. This is enough
to get all stats for the current index but it's subject to concurrency
and might return numbers that are not consistent ie. some cases can
return maxDoc < numDocs which is undesirable. This change adds a getDocStats()
method to index writer to allow fetching consistent numbers for these stats.
This change also deprecates IndexWriter#numDocs() and IndexWriter#maxDoc()
and replaces all their usages wiht IndexWriter#getDocStats()
MemoryIndex: compute/cache up-front
Solr Collapse/Expand with top_fc: compute/cache up-front
Json Facets numerics / hash DV: use the cached fieldInfos on SolrIndexSearcher
SolrIndexSearcher: move the cached FieldInfos to SlowCompositeReaderWrapper
Many tests were written before we introduced IndexSearcher#count and used
`searcher.search(query, 1).totalHits` to get the number of matches of a query
rather than `searcher.count(query)`.
This change add the number of documents that are soft deletes but
not hard deleted to the segment commit info. This is the last step
towards making soft deletes as powerful as hard deltes since now the
number of document can be read from commit points without opening a
full blown reader. This also allows merge posliies to make decisions
without requiring an NRT reader to get the relevant statistics. This
change doesn't enforce any field to be used as soft deletes and the statistic
is maintained per segment.
Soft deletes field must be marked as such once it's introduced
and can't be changed after the fact.
Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co>
This change introduces a new MergePolicy.MergeContext interface
that is easy to mock and cuts over all instances of IW to MergeContext.
Since IW now implements MergeContext the cut over is straight forward.
This reduces the exposed API available in MP dramatically and allows
efficient testing without relying on IW to improve the coverage and
testability of our MP implementations.
When renaming a file, `FSDirectory#rename` tries to delete the dest file
if it's in the pending deletes list. If that delete fails, it adds the
dest to the pending deletes list again. This causes the dest file to be
deleted later by `deletePendingFiles`.
Today we fail creating the IndexWriter when the directory has a
pending delete. Yet, this is mainly done to prevent writing still
existing files more than once. IndexFileDeleter already accounts for
that for existing files which we can now use to also take pending
deletes into account which ensures that all file generations per segment
always go forward.
Today once a document has a value in a certain DV field this values
can only be changed but not removed. While resetting / removing a value
from a field is certainly a corner case it can be used to undelete a
soft-deleted document unless it's merged away.
This allows to rollback changes without rolling back to another commitpoint
or trashing all uncommitted changes. In certain cenarios it can be used to
"repair" history of documents in distributed systems.
The particular test here is #testStressLocks that has several protectesion against
WindowsFS and special logic in the catch clause that steps out on fatal exceptions with
pending deletes. Since we now check this consistently in the IW ctor we need to also
skip this entire test if we are on windows and have pending deletes.
IndexWriter today is shared with many classes like BufferedUpdateStream,
DocumentsWriter and DocumentsWriterPerThread. Some of them even acquire locks
on the writer instance or assert that the current thread doesn't hold a lock.
This makes it very difficult to have a manageable threading model.
This change separates out the IndexWriter from those classes and makes them all
independent of IW. IW now implements a new interface for DocumentsWriter to communicate
on failed or successful flushes and tragic events. This allows IW to make it's critical
methods private and execute all lock critical actions on it's private queue that ensures
that the IW lock is not held. Follow-up changes will try to detach more code like
publishing flushed segments to ensure we never call back into IW in an uncontrolled way.
Inside the IndexWriter buffers are only written to disk if it's needed
or "worth it" which doesn't guarantee soft deletes to be accounted
in time. This is not necessarily a problem since they are eventually
collected and segments that have soft-deletes will me merged eventually
but for tests and on par behavior compared to hard deletes this behavior
is tricky.
This change cuts over to accounting in-place just like hard-deletes. This
results in accurate delete numbers for soft deletes at any give point in time
once the reader is loaded or a pending soft delete occurs.
This change also fixes an issue where all updates to a DV field are allowed
event if the field is unknown. Now this only works if the field is equal
to the soft deletes field. This behavior was never released.
This adds support for soft deletes if the reader is opened form a directory.
Today we only support soft deletes for NRT readers, this change allows to wrap
existing DirectoryReader with a SoftDeletesDirectoryReaderWrapper to also filter
out soft deletes in the case of a non-NRT reader.
This change adds support for soft deletes as a fully supported feature
by the index writer. Soft deletes are accounted for inside the index
writer and therefor also by merge policies.
This change also adds a SoftDeletesRetentionMergePolicy that allows
users to selectively carry over soft_deleted document across merges
for renention policies. The merge policy selects documents that should
be kept around in the merged segment based on a user provided query.
Several places in the index package don't handle exceptions well or ignores them.
This change adds some utility methods and cuts over to make use of try/with blocks
to simplify exception handling.
Index/Update Threads try to help out flushing pending document buffers to
disk. This change adds an expert setting to opt ouf of this behavior unless
flusing is falling behind.
Adds a `flushNextBuffer` method to IndexWriter that allows the caller to
synchronously move the next pending or the biggest non-pending index buffer to
disk. This enables flushing selected buffer to disk without highjacking an
indexing thread. This is for instance useful if more than one IW (shards) must
be maintained in a single JVM / system.