SOLR-12823: remove clusterstate.json in Lucene/Solr 9.0 - fix TestZKPropertiesWriter
TestZKPropertiesWriter relied on removed legacy features of the SolrCloud cluster to work.
Start a MiniSolrCloudCluster (implies config set and other test resources config) and have the test use the core of a created collection.
Add support for BlockMax WAND via a minExactHits parameter. Hits will be counted accurately at least until this value, and above that, the count will be an approximation. In distributed search requests, the count will be per shard, so potentially the count will be accurately counted until numShards * minExactHits. The response will include the value numFoundExact which can be true (The value in numFound is exact) or false (the value in numFound is an approximation).
* Remove SRL.listConfigDir (unused)
* Remove SRL.getDataDir
* Remove SRL.getCoreName
* Remove SRL.getCoreProperties
XmlConfigFile needs to be passed in the substitutableProperties
IndexSchema needs to be passed in the substitutableProperties
Remove redundant Properties from CoreContainer constructors
* Remove SRL.newAdminHandlerInstance (unused)
* Remove SRL.openSchema and openConfig
* Avoid SRL.getConfigDir
Also harmonized similar initialization logic between DIH Tika processor & ExtractingRequestHandler.
* Ensure SRL.addToClassLoader and reloadLuceneSPI are called at most once
Don't auto-load "lib" in constructor; wrong place for this logic.
* Avoid SRL.getInstancePath
Added SolrCore.getInstancePath instead
Use CoreContainer.getSolrHome instead
NodeConfig should track solrHome separate from SolrResourceLoader
* Simplify some SolrCore constructors
* Move locateSolrHome to new SolrPaths
* Move "User Files" stuff to SolrPaths
Previous situation:
* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.
Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.
New situation:
* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
Java 13 adds a new doclint check under "accessibility" that the html
header nesting level isn't crazy.
Many are incorrect because the html4-style javadocs had horrible
font-sizes, so developers used the wrong header level to work around it.
This is no issue in trunk (always html5).
Java recommends against using such structured tags at all in javadocs,
but that is a more involved change: this just "shifts" header levels
in documents to be correct.
Current javadocs declare an HTML5 doctype: !DOCTYPE HTML. Some HTML5
features are used, but unfortunately also some constructs that do not
exist in HTML5 are used as well.
Because of this, we have no checking of any html syntax. jtidy is
disabled because it works with html4. doclint is disabled because it
works with html5. our docs are neither.
javadoc "doclint" feature can efficiently check that the html isn't
crazy. we just have to fix really ancient removed/deprecated stuff
(such as use of tt tag).
This enables the html checking in both ant and gradle. The docs are
fixed via straightforward transformations.
One exception is table cellpadding, for this some helper CSS classes
were added to make the transition easier (since it must apply padding
to inner th/td, not possible inline). I added TODOs, we should clean
this up. Most problems look like they may have been generated from a
GUI or similar and not a human.
* DocValuesFieldExistsQuery and NormsFieldExistsQuery are used for existence queries when possible.
* Added documentation on the difference between field:* and field:[* TO *]
Solr tests now have a similar policy to Lucene, loopback use only. If a
test tries to resolve or connect to the internet, it will get SecurityException.
Some solr tests explicitly try to talk to dead nodes with real
networking. This is not good and asking for trouble, but use low loopback port numbers instead of
multicast addresses. The idea is that it fails faster. Move these to
constants so that stuff isn't copy-pasted everywhere, in case we have to
do something different later.
The solr permissions are weak sauce due to the huge number of features, third-party dependencies, etc.
Hence they have access to do many things. For "scripting" such as velocity we have to look at a more aggressive stance:
Step 1: Can we wrap a sandbox around the whole goddamn thing and call it a day?
Step 2: Let's separate the "engine" from "untrusted code" and only be an asshole to the latter.
Step 3: Java's security is shit, Lets contain that classloader and whitelist access.
On windows, these objects can't be inspected due to security restrictions. So the test runner fails the tests since it does not know how big the leak is.
1) use per-method state isolation in several tests...
This helps prevent failures in one test to allow persisted date to leak into other test methods,
as well as ensuring that these tests play nicely with -Dtests.iters > 1
2) TestRerankBase cleanup to eliminate unnecessary extra SolrCore (that was being leaked)
This specific commit affects all points in the casebase where the argument of a StringBuilder.append() call is itself a regular String concatenation.
This defeats the purpose of using StringBuilder and also introduces an extra alloction.
These changes should avoid that.
ant tests have run, succeeded on local machine.
Removing test files from the changes.
Another suggested rework.