This change moves the deletes tracked by FrozenBufferedUpdates that
are private to the DWPT and never used in a global context out of
FrozenBufferedUpdates.
After recent refactoring on LUCENE-9304 `IW#getMaxCompletedSequenceNumber()` might
return values that belong to non-completed operations if a full flush is running, a new delete
queue is already in place but not all DWPTs that participate in the full flush have finished it's in
flight operation. This caused rare failures in
`TestControlledRealTimeReopenThread#testControlledRealTimeReopenThread` where
documents are not actually visible given the max completed seqNo. This change streamlines
the delete queue advance, adds a dedicated testcase and ensures that a delete queues
sequence Id space is never exhausted.
This change removes the ThreadState indirection from DWPTPool and pools DWPT directly. The tracking information and locking semantics are mostly moved to DWPT directly and the pool semantics have changed slightly such that DWPT need to be checked-out in the pool once they need to be flushed or aborted. This automatically grows and shrinks the number of DWPT in the system when number of threads grow or shrink. Access of pooled DWPTs is more straight forward and doesn't require ordinal. Instead consumers can just iterate over the elements in the pool.
This allowed for removal of indirections in DWPTFlushControl like BlockedFlush, the removal of DWPTPool setter and getter in IndexWriterConfig and the addition of stronger assertions in DWPT and DW.
This test failed on Elastic CI because we did not add any term in the
loop. This commit ensures that we always add at least one docId, term
and query in the test.
The SegmentMerger usage in IW#addIndexes(CodecReader...) might make changes
to the Directory while the IW tries to clean-up files on rollback. This
causes issues like FileNotFoundExceptions when IDF tries to remove temp files.
This changes adds a waiting mechanism to the abortMerges method that, in addition
to the running merges, also waits for merges in addIndices(CodecReader...)
Today a doc values update creates a new field infos file that contains the original field infos updated for the new generation as well as the new fields created by the doc values update.
However existing fields are cloned through the global fields (shared in the index writer) instead of the local ones (present in the segment).
In practice this is not an issue since field numbers are shared between segments created by the same index writer.
But this assumption doesn't hold for segments created by different writers and added through IndexWriter#addIndexes(Directory).
In this case, the field number of the same field can differ between segments so any doc values update can corrupt the index
by assigning the wrong field number to an existing field in the next generation.
When this happens, queries and merges can access wrong fields without throwing any error, leading to a silent corruption in the index.
This change ensures that we preserve local field numbers when creating
a new field infos generation.
This commit introduces a mechanism to control allocation of threads to slices planned for a query.
The default implementation uses the size of backlog queue of the executor to determine if a slice should be allocated a new thread
Today we have a large amount of duplicated code that is rather of
complex nature. This change consolidates the code-paths to always
use the updateDocuments path.
IndexWriter must process all pending events before closing the writer during rollback to prevent AlreadyClosedExceptions from being thrown during event processing which can cause the writer to be closed with a tragic event.
Today we have duplicated logic on how to convert a seqNo into a real
seqNo and process events based on this. This change consolidated the logic
into a single method.
* LUCENE-8962: Simplify test case
The testMergeOnCommit test case was trying to verify too many things
at once: basic semantics of merge on commit and proper behavior when
a bunch of indexing threads are writing and committing all at once.
Now we just verify basic behavior, with strict assertions on invariants, while
leaving it to MockRandomMergePolicy to enable merge on commit in existing
test cases to verify that indexing generally works as expected and no new
unexpected exceptions are thrown.
* LUCENE-8962: Only update toCommit if merge was committed
The code was previously assuming that if mergeFinished() was called and
isAborted() was false, then the merge must have completed successfully.
Instead, we should know for sure if a given merge was committed, and
only then update our pending commit SegmentInfos.
This commit makes ValueSourceScorer's costing algorithm also take the delegated FunctionValues's cost into consideration when calculating its cost. FunctionValues now exposes a cost method which is used by ValueSourceScorer's default matchCost method. In addition, ValueSourceScorer exposes a matchCost method which can be overridden to specify a custom costing mechanism
1. TestIndexWriterMergePolicy.testMergeOnCommit will fail if the last
commit (the one that should trigger the full merge) doesn't have any
pending changes (which could occur if the last indexing thread
commits at the end). We can fix that by adding one more document
before that commit.
2. The previous implementation was throwing IOException if the commit
thread gets interrupted while waiting for merges to complete. This
violates IndexWriter's documented behavior of throwing
ThreadInterruptedException.
* LUCENE-8962: Add ability to selectively merge on commit
This adds a new "findCommitMerges" method to MergePolicy, which can
specify merges to be executed before the
IndexWriter.prepareCommitInternal method returns.
If we have many index writer threads, they will flush their DWPT buffers
on commit, resulting in many small segments, which can be merged before
the commit returns.
* Add missing Javadoc
* Fix incorrect comment
* Refactoring and fix intermittent test failure
1. Made some changes to the callback to update toCommit, leveraging
SegmentInfos.applyMergeChanges.
2. I realized that we'll never end up with 0 registered merges, because
we throw an exception if we fail to register a merge.
3. Moved the IndexWriterEvents.beginMergeOnCommit notification to before
we call MergeScheduler.merge, since we may not be merging on another
thread.
4. There was an intermittent test failure due to randomness in the time
it takes for merges to complete. Before doing the final commit, we wait
for pending merges to finish. We may still end up abandoning the final
merge, but we can detect that and assert that either the merge was
abandoned (and we have > 1 segment) or we did merge down to 1 segment.
* Fix typo
* Fix/improve comments based on PR feedback
* More comment improvements from PR feedback
* Rename method and add new MergeTrigger
1. Renamed findCommitMerges -> findFullFlushMerges.
2. Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to
MergeScheduler when merging on commit.
* Update renamed method name in strings and comments
This adds a test to `BaseIndexFileFormatTestCase` that the combination
of opening a reader and calling `checkIntegrity` on it reads all bytes
of all files (including index headers and footers). This would help
detect most cases when `checkIntegrity` is not implemented correctly.
QueryBuilder currently has special logic for graph phrase queries with no slop,
constructing a spanquery that attempts to follow all paths using a combination of
OR and NEAR queries. However, this type of query has known bugs(LUCENE-7398).
This commit removes this logic and just builds a disjunction of phrase queries, one
phrase per path.
SOLR-12238: Handle boosts in QueryBuilder
QueryBuilder now detects per-term boosts supplied by a BoostAttribute when
building queries using a TokenStream. This commit also adds a DelimitedBoostTokenFilter
that parses boosts from tokens using a delimiter token, and exposes this in Solr
Die, python2, die.
Some generated .java files change (parameterized automata for
spell-correction).
This is because the order of python dictionaries was not well-defined
previously. A sort() was added so that the python code now generates
reproducible output (Thanks @mikemccand).
So we'll suffer a change once, but the automata are equivalent. If you
run the script again you should not see source code changes.
The relevant unit tests are exhaustive (if you trust the paper!), so we can
be confident it does not break things, even though it looks very scary.
With this change, we sort dvUpdates in the term order before applying if
they all update a single field to the same value. This optimization can
reduce the flush time by around 20% for the docValues update user cases.
On newer linux distros, at least, 'python' now means python3. So
we can't rely on what version of python it will invoke (at least for a
few years).
For example in Fedora Linux:
https://fedoraproject.org/wiki/Changes/Python_means_Python3
For python2.x code, explicitly call 'python2.7' and for python3.x code,
explicitly call 'python3'.
Ant variable names are cleaned up, e.g. 'python.exe' is renamed to
'python2.exe' and 'python32.exe' is renamed to 'python3.exe'. This also
makes it easy to identify remaining python 2.x code that should be
migrated to python 3.x
Previous situation:
* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.
Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.
New situation:
* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
Java 13 adds a new doclint check under "accessibility" that the html
header nesting level isn't crazy.
Many are incorrect because the html4-style javadocs had horrible
font-sizes, so developers used the wrong header level to work around it.
This is no issue in trunk (always html5).
Java recommends against using such structured tags at all in javadocs,
but that is a more involved change: this just "shifts" header levels
in documents to be correct.
the "missing javadocs" checker needed tweaks to work with the format
changes of java 13.
As a followup we may investigate javadoc (maybe the new doclet api). It
has its own missing checks too now, but they are black vs white (either
fully documented or not checked), whereas this python tool allows us to
"improve", e.g. enforce that all classes have doc, even if all
methods do not yet.
Current javadocs declare an HTML5 doctype: !DOCTYPE HTML. Some HTML5
features are used, but unfortunately also some constructs that do not
exist in HTML5 are used as well.
Because of this, we have no checking of any html syntax. jtidy is
disabled because it works with html4. doclint is disabled because it
works with html5. our docs are neither.
javadoc "doclint" feature can efficiently check that the html isn't
crazy. we just have to fix really ancient removed/deprecated stuff
(such as use of tt tag).
This enables the html checking in both ant and gradle. The docs are
fixed via straightforward transformations.
One exception is table cellpadding, for this some helper CSS classes
were added to make the transition easier (since it must apply padding
to inner th/td, not possible inline). I added TODOs, we should clean
this up. Most problems look like they may have been generated from a
GUI or similar and not a human.
* LUCENE-9142 Refactor SortedIntSet for equality
Split SortedIntSet into a class heirarchy to make comparisons to
FrozenIntSet more meaningful. Use Arrays.equals for more efficient
comparison. Add tests for IntSet to verify correctness.
If you have repeating intervals in an ordered or unordered interval source, you currently
get somewhat confusing behaviour:
* `ORDERED(a, a, b)` will return an extra interval over just a b if it first matches a a b, meaning
that you can get incorrect results if used in a `CONTAINING` filter -
`CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a b y`
* `UNORDERED(a, a)` will match on documents that just containg a single a.
This commit adds a RepeatingIntervalsSource that correctly handles repeats within
ordered and unordered sources. It also changes the way that gaps are calculated within
ordered and unordered sources, by using a new width() method on IntervalIterator. The
default implementation just returns end() - start() + 1, but RepeatingIntervalsSource
instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering
on ordered and unordered sources that contain repeats.
In order to correctly handle matches in this scenario, IntervalsSource#matches now always
returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds
gaps() and width() methods so that submatches can be combined in the same way that
subiterators are. Extra checks have been added to checkIntervals() to ensure that the same
intervals are returned by both iterator and matches, and a fix to
DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes
its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between
the two methods.
IndexMergeTool previously had no options and always forceMerge(1)
the resulting index. This can result in wasted work and confusing
performance (unbalancing the index).
Instead the default is to not do anything, except merges from the
merge policy.
This replaces the index of stored fields and term vectors with two
`DirectMonotonic` arrays. `DirectMonotonicWriter` requires to know the number
of values to write up-front, so incoming doc IDs and file pointers are buffered
on disk using temporary files that never get fsynced, but have index headers
and footers to make sure any corruption in these files wouldn't propagate to the
index.
`DirectMonotonicReader` gets a specialized `binarySearch` implementation that
leverages the metadata in order to avoid going to the IndexInput as often as
possible. Actually in the common case, it would only go to a single
sub `DirectReader` which, combined with the size of blocks of 1k values, helps
bound the number of page faults to 2.
The entire precommit task will still fail with unsupported java version
(subsequent checks do not support the newer javadocs format).
But this allows the ECJ linter to run, which checks for things such as
unused imports.
Fuzzy queries with an edit distance of 1 or 2 must visit all blocks whose prefix
length is 1 or 2. By not compressing those, we can trade very little space (a
couple MBs in the case of the wikibigall index) for better query efficiency.
Changes include:
- Removed LZ4 compression of suffix lengths which didn't save much space
anyway.
- For stats, LZ4 was only really used for run-length compression of terms whose
docFreq is 1. This has been replaced by explicit run-length compression.
- Since we only use LZ4 for suffix bytes if the compression ration is < 75%, we
now only try LZ4 out if the average suffix length is greater than 6, in order
to reduce index-time overhead.
The issue is that MockDirectoryWrapper's disk full check is horribly
inefficient. On every writeByte/etc, it totally recomputes disk space
across all files. This means it calls listAll() on the underlying
Directory (which sorts all the underlying files), then sums up fileLength()
for each of those files.
This leads to many pathological cases in the disk full tests... but the
number of tests impacted by this is minimal, and the logic is scary.
WhitespaceTokenizer defaults to a maximum token length of 255, and WhitespaceAnalyzer
does not allow this to be changed. This commit adds an optional maxTokenLen parameter
to WhitespaceAnalyzer as well, and documents the existing token length restriction.
Compress blocks of suffixes in order to make the terms dictionary more
space-efficient. Two compression algorithms are used depending on which one is
more space-efficient:
- LowercaseAsciiCompression, which applies when all bytes are in the
`[0x1F,0x3F)` or `[0x5F,0x7F)` ranges, which notably include all digits,
lowercase ASCII characters, '.', '-' and '_', and encodes 4 chars on 3 bytes.
It is very often applicable on analyzed content and decompresses very quickly
thanks to auto-vectorization support in the JVM.
- LZ4, when the compression ratio is less than 0.75.
I was a bit unhappy with the complexity of the high-compression LZ4 option, so
I simplified it in order to only keep the logic that detects duplicate strings.
The logic about what to do in case overlapping matches are found, which was
responsible for most of the complexity while only yielding tiny benefits, has
been removed.
Calming down individual test methods with double-digit execution times
after running tests many times.
There are a few more issues remaining, but this solves the majority of them.
This is consistently the slowest test for me in all of lucene core by
far. Takes around an entire minute. Mark it nightly: should catch any
issues with RAM estimation but keep local builds fast.
Adds some build parameters to tune how tests run. There is an example
shown by "gradle helpLocalSettings"
Default C2 off in tests as it is wasteful locally and causes slowdown of
tests runs. You can override this by setting tests.jvmargs for gradle,
or args for ant.
Some crazy lucene stress tests may need to be toned down after the
change, as they may have been doing too many iterations by default...
but this is not a new problem.
FuzzyTermsEnum can now either take an array of compiled automata, and
an AttributeSource, to be used across multiple segments (eg during
FuzzyQuery rewrite); or it can take a term, edit distance, prefix and transition
boolean and build the automata itself if only being used once (eg for fuzzy
nearest neighbour calculations).
Rather than interact via attribute sources and specialized attributes, users of
FuzzyTermsEnum can get the boost and set minimum competitive boosts
directly on the enum.
* SOLR-13984: add (experimental, disabled by default) security manager support.
User can set SOLR_SECURITY_MANAGER_ENABLED=true to enable security manager at runtime.
The current policy file used by tests is moved to solr/server
Additional permissions are granted for the filesystem locations set by bin/solr, and networking everywhere is enabled.
This takes advantage of the fact that permission entries are ignored if properties are not defined:
https://docs.oracle.com/javase/7/docs/technotes/guides/security/PolicyFiles.html#PropertyExp
Jetty 9.4.16.v20190411 and up introduced separate
client and server SslContextFactory implementations.
This split requires the proper use of of
SslContextFactory in clients and server configs.
This fixes the following
* SSL with SOLR_SSL_NEED_CLIENT_AUTH not working since v8.2.0
* Http2SolrClient SSL not working in branch_8x
Signed-off-by: Kevin Risden <krisden@apache.org>