SOLR-12238: Handle boosts in QueryBuilder
QueryBuilder now detects per-term boosts supplied by a BoostAttribute when
building queries using a TokenStream. This commit also adds a DelimitedBoostTokenFilter
that parses boosts from tokens using a delimiter token, and exposes this in Solr
Previous situation:
* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.
Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.
New situation:
* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
Previous changes to this issue 'fixed' the way the test was creating mock Replica instances,
to ensure all properties were specified -- but these changes tickled a bug in the existing test
scaffolding that caused it's "expecations" to be based on a regex check against only the base "url"
even though the test logic itself looked at the entire "core url"
The result is that there were reproducible failures if/when the randomly generated regex matched
".*1.*" because the existing test logic did not expect that to match the url or a Replica with
a core name of "core1" because it only considered the base url
SOLR-13996: Refactor HttpShardHandler.prepDistributed method into smaller pieces
This commit introduces an interface named ReplicaSource which is marked as experimental. It has two sub-classes named CloudReplicaSource (for solr cloud) and LegacyReplicaSource for non-cloud clusters. The prepDistributed method now calls out to these sub-classes depending on whether the cluster is running on cloud mode or not.
* No Introduction (to Solr) header. Point at solr-upgrade-notes.adoc instead
* No Getting Started header
* No Versions of Major Components header
* No "Upgrade Notes" for subsequent releases. See solr-upgrade-notes.adoc
Closes#1202
Java 13 adds a new doclint check under "accessibility" that the html
header nesting level isn't crazy.
Many are incorrect because the html4-style javadocs had horrible
font-sizes, so developers used the wrong header level to work around it.
This is no issue in trunk (always html5).
Java recommends against using such structured tags at all in javadocs,
but that is a more involved change: this just "shifts" header levels
in documents to be correct.
Current javadocs declare an HTML5 doctype: !DOCTYPE HTML. Some HTML5
features are used, but unfortunately also some constructs that do not
exist in HTML5 are used as well.
Because of this, we have no checking of any html syntax. jtidy is
disabled because it works with html4. doclint is disabled because it
works with html5. our docs are neither.
javadoc "doclint" feature can efficiently check that the html isn't
crazy. we just have to fix really ancient removed/deprecated stuff
(such as use of tt tag).
This enables the html checking in both ant and gradle. The docs are
fixed via straightforward transformations.
One exception is table cellpadding, for this some helper CSS classes
were added to make the transition easier (since it must apply padding
to inner th/td, not possible inline). I added TODOs, we should clean
this up. Most problems look like they may have been generated from a
GUI or similar and not a human.
SOLR-14095 Introduced an issue for rolling restarts (Incompatible Java serialization). This change fixes the compatibility issue while keeping the functionality in SOLR-14095
This triggers various places in the Streaming Expressions code that use background threads
to confirm that the expected credentails (or lack of) are propogarded along.
Test currently has comments + workarounds for 2 known client issues:
- SOLR-14226: SolrStream reports AuthN/AuthZ failures (401|403) as IOException w/o details
- SOLR-14222: CloudSolrClient converts (update) 403 error to 500 error
This also fixes a bug where an inability to assign a node based on existing autoscaling policy resulted in a server error instead of a bad request.
This closes#1152.
* DocValuesFieldExistsQuery and NormsFieldExistsQuery are used for existence queries when possible.
* Added documentation on the difference between field:* and field:[* TO *]
* Use Caffeine impl and weak values (to the schema). Previously the cache never evicted!
* now populating the configSet name from ZK into CloudDescriptor when CloudDescriptor is loaded
* actual schema name needs to be deterministic now; fallback from non-existent managed-schema to schema.xml will thwart this cache
* a test conf/core.properties wasn't actually used and became a problem in it's weird location after I refactored some logic
Prior to this commit, Solr's Jetty listened for connections on all
network interfaces. This commit changes it to only listen on localhost,
to prevent incautious administrators from accidentally exposing their
Solr deployment to the world.
Administrators who wish to override this behavior can set the
SOLR_JETTY_HOST property in their Solr include file
(solr.in.sh/solr.in.cmd) to "0.0.0.0" or some other value.
A version of this commit was previously reverted due to inconsistency
between SOLR_HOST and SOLR_JETTY_HOST. This commit fixes this issue.
{!terms} queries have a docValues-based implementation that uses per-segment DV structures. This does well with a small to moderate (a few hundred) number of query terms, but doesn't well scale beyond that due to repetitive seeks done on each segment.
This commit introduces an implementation that uses a "top-level" docValues structure, which scales much better to very large {!terms} queries (many hundreds, thousands of terms).
* force a hard commit of all docs in TestCloudConsistency to work around bug in that test
* add new AwaitsFix'ed TestTlogReplayVsRecovery that more explicitly demonstrates the bug via TestInjection.updateLogReplayRandomPause
Prior to this commit, Solr's Jetty listened for connections on all
network interfaces. This commit changes it to only listen on localhost,
to prevent incautious administrators from accidentally exposing their
Solr deployment to the world.
Administrators who wish to override this behavior can set the
SOLR_JETTY_HOST property in their Solr include file
(solr.in.sh/solr.in.cmd) to "0.0.0.0" or some other value.
Currently the documentation pretends to create a JKS keystore. It is
only actually a JKS keystore on java 8: on java9+ it is a PKCS12
keystore with a .jks extension (because PKCS12 is the new java default).
It works even though solr explicitly tells the JDK
(SOLR_SSL_KEY_STORE_TYPE=JKS) that its JKS when it is in fact not, due
to how keystore backwards compatibility was implemented.
Fix docs to explicitly create a PKCS12 keystore with .p12 extension and
so on instead of a PKCS12 keystore masquerading as a JKS one. This
simplifies the SSL steps since the "conversion" step (which was doing
nothing) from .JKS -> .P12 can be removed.
* SOLR-13984: add (experimental, disabled by default) security manager support.
User can set SOLR_SECURITY_MANAGER_ENABLED=true to enable security manager at runtime.
The current policy file used by tests is moved to solr/server
Additional permissions are granted for the filesystem locations set by bin/solr, and networking everywhere is enabled.
This takes advantage of the fact that permission entries are ignored if properties are not defined:
https://docs.oracle.com/javase/7/docs/technotes/guides/security/PolicyFiles.html#PropertyExp
SOLR-14136: ip whitelist/blacklist via env vars
This makes it easy to restrict access to Solr by IP. For example SOLR_IP_WHITELIST="127.0.0.1, 192.168.0.0/24, [::1], [2000:123:4:5::]/64" would restrict access to v4/v6 localhost, the 192.168.0 ipv4 network, and 2000:123:4:5 ipv6 network. Any other IP will receive a 403 response.
Blacklisting functionality can deny access to problematic addresses or networks that would otherwise be allowed. For example SOLR_IP_BLACKLIST="192.168.0.3, 192.168.0.4" would explicitly prevent those two specific addresses from accessing solr.
User can now set SOLR_REQUESTLOG_ENABLED=true to enable the jetty request log, instead of editing XML. The location of the request logs will respect SOLR_LOGS_DIR if that is set. The deprecated NCSARequestLog is no longer used, instead it uses CustomRequestLog with NCSA_FORMAT.
The Overseer used java serialization to store command responses in ZooKeeper. This commit changes the code to use Javabin instead, while allowing Java serialization with a System property in case it's needed for compatibility
Jetty 9.4.16.v20190411 and up introduced separate
client and server SslContextFactory implementations.
This split requires the proper use of of
SslContextFactory in clients and server configs.
This fixes the following
* SSL with SOLR_SSL_NEED_CLIENT_AUTH not working since v8.2.0
* Http2SolrClient SSL not working in branch_8x
Signed-off-by: Kevin Risden <krisden@apache.org>
Thread.sleep() is "subject to the precision and accuracy of system timers and schedulers."
But tests using DelayingSearchComponent need to ensure that it sleeps *at least* as long as they request, in order to trigger the timeAllowed constraint
The helper method RoutedAliasUpdateProcessorTest.addDocsAndCommit doesn't garuntee docs have been committed when it returns, causing threading/timing bugs in tests that use it as a gate for making subsequent assertions -- causing a steady stream of jenkins test failures
Introduced in SOLR-14033 by including
commons-compress as a compile time
dependency in Solr core instead of as
as test only dependency.
Signed-off-by: Kevin Risden <krisden@apache.org>
* Better logging and error reporting
* Fixing deploy command to handle previously undeployed packages
* Test now uses @LogLevel annotation
* Deploy command had a hard coded collection name by mistake, fix it
* split "resource-and-plugin-loading.adoc" into "resource-loading.adoc" and "libs.adoc" then overhauled both.
* enhanced "config-sets.adoc", moving some content in from elsewhere; bit of an overhaul.
* solr-plugins.adoc is now top-level; overhauled content
* Move resource-loading.adoc up a level in the TOC to underneath "The Well-Configured Solr Instance.
* Separate out the leading sentence.
The default configset no longer has the following:
- Library inclusions (<lib ../>) for extraction, solr-cell libs, clustering, velocity and language identifier
- /browse, /tvrh and /update/extract handlers
- TermVector component (if someone wants it, can be added using config APIs)
- XSLT response writer
- Velocity response writer
If you want to use them in your collections, please add them to your configset manually or through the Config APIs.
* Using collapse with grouping would cause inconsistent behavior.
This is because grouping calls the same postfilter twice without
resetting the internal state of the DocValues cache
* Using expand with grouping would cause NPE
Solr tests now have a similar policy to Lucene, loopback use only. If a
test tries to resolve or connect to the internet, it will get SecurityException.
Some solr tests explicitly try to talk to dead nodes with real
networking. This is not good and asking for trouble, but use low loopback port numbers instead of
multicast addresses. The idea is that it fails faster. Move these to
constants so that stuff isn't copy-pasted everywhere, in case we have to
do something different later.
This removes the Solr security manager hacks
for Hadoop. It does so by:
* Using a fake group mapping class instead of ShellGroupMapping
* Copies a few Hadoop classes and modifies them for tests with no Shell
* Nulls out some of the static variables in the tests
The Hadoop files were copied from Apache Hadoop 3.2.0
and copied to the test package to be only picked up
during tests. They were modified to remove the need to
shell out for access. The assumption is that these
HDFS integration tests only run on Unix based systems
and therefore Windows compatibility was removed in some
of the modified classes. The long term goal is to remove
these custom Hadoop classes. All the copied classes are
in the org.apache.hadoop package.
Signed-off-by: Kevin Risden <krisden@apache.org>
There are applications of ExpandComponent that intentionally do not
involve prior collapsing of results on the expand field, which can lead
to an NPE in expand component when expand.field (for matched docs) has
fewer unique values than the number of matched docs.
This commit refines the approach taken in SOLR-13877, which addressed
the same underlying issue.
* rename stdDev, variance methods to reflect the functionality
* add util functions to compute corrected stdDev and variance
* use DocValuesIterator#advanceExact to check if values exists for the doc
whoami displays a warning if the effective-uid is not in /etc/password.
This can happen in certain situations when running in a docker
container. This replaces the 'whoami' usage with a safer check.
The solr permissions are weak sauce due to the huge number of features, third-party dependencies, etc.
Hence they have access to do many things. For "scripting" such as velocity we have to look at a more aggressive stance:
Step 1: Can we wrap a sandbox around the whole goddamn thing and call it a day?
Step 2: Let's separate the "engine" from "untrusted code" and only be an asshole to the latter.
Step 3: Java's security is shit, Lets contain that classloader and whitelist access.
Restrict this to only minimal paths like lucene. It is the defense for directory traversal attacks.
It will also help find bad bugs where things are reading filesystem in the wrong locations.
On windows, these objects can't be inspected due to security restrictions. So the test runner fails the tests since it does not know how big the leak is.