Merge branch 'master' into feature/autoscaling

# Conflicts:
#	solr/CHANGES.txt
#	solr/core/src/java/org/apache/solr/cloud/Assign.java
#	solr/core/src/java/org/apache/solr/cloud/CreateCollectionCmd.java
#	solr/core/src/test/org/apache/solr/cloud/autoscaling/AutoScalingHandlerTest.java
This commit is contained in:
Shalin Shekhar Mangar 2017-08-19 17:26:58 +05:30
commit 2a6acd3a87
86 changed files with 3903 additions and 648 deletions

View File

@ -222,7 +222,7 @@ org.codehaus.janino.version = 2.7.6
/org.codehaus.woodstox/stax2-api = 3.1.4
/org.codehaus.woodstox/woodstox-core-asl = 4.4.1
org.eclipse.jetty.version = 9.3.14.v20161028
org.eclipse.jetty.version = 9.3.20.v20170531
/org.eclipse.jetty/jetty-continuation = ${org.eclipse.jetty.version}
/org.eclipse.jetty/jetty-deploy = ${org.eclipse.jetty.version}
/org.eclipse.jetty/jetty-http = ${org.eclipse.jetty.version}

View File

@ -1 +0,0 @@
4ba272cee2e367766dfdc1901c960de352160d41

View File

@ -0,0 +1 @@
0176f1ef8366257e7b6214c3bbd710cf47593135

View File

@ -1 +0,0 @@
ea3800883f79f757b2635a737bb71bb21e90cf19

View File

@ -0,0 +1 @@
32f5fe22ed468a49df1ffcbb27c39c1b53f261aa

View File

@ -1 +0,0 @@
52d796b58c3a997e59e6b47c4bf022cedcba3514

View File

@ -0,0 +1 @@
5b68e7761fcacefcf26ad9ab50943db65fda2c3d

View File

@ -1 +0,0 @@
791df6c55ad62841ff518ba6634e905a95567260

View File

@ -0,0 +1 @@
6a1523d44ebb527eed068a5c8bfd22edd6a20530

View File

@ -1 +0,0 @@
b5714a6005387b2a361d5b39a3a37d4df1892e62

View File

@ -0,0 +1 @@
21a698f9d58d03cdf58bf2a40f93de58c2eab138

View File

@ -1 +0,0 @@
fbf89f6f3b995992f82ec09104ab9a75d31d281b

View File

@ -0,0 +1 @@
19ce4203809da37f8ea7a5632704fa71b6f0ccc2

View File

@ -27,7 +27,7 @@ Carrot2 3.15.0
Velocity 1.7 and Velocity Tools 2.0
Apache UIMA 2.3.1
Apache ZooKeeper 3.4.10
Jetty 9.3.14.v20161028
Jetty 9.3.20.v20170531
(No Changes)
@ -43,7 +43,7 @@ Carrot2 3.15.0
Velocity 1.7 and Velocity Tools 2.0
Apache UIMA 2.3.1
Apache ZooKeeper 3.4.10
Jetty 9.3.14.v20161028
Jetty 9.3.20.v20170531
Detailed Change List
----------------------
@ -149,9 +149,6 @@ Other Changes
* SOLR-11106: TestLBHttpSolrClient.testReliablity takes 30 seconds because of the wrong server name
(Kensho Hirasawa via Erick Erickson)
* SOLR-11122: Creating a core should write a core.properties file first and clean up on failure
(Erick Erickson)
* SOLR-10338: Configure SecureRandom non blocking for tests. (Mihaly Toth, hossman, Ishan Chattopadhyaya, via Mark Miller)
@ -181,6 +178,8 @@ Other Changes
* SOLR:10822: Share a Policy.session object between multiple collection admin calls to get the correct computations (noble)
* SOLR-11249: Upgrade Jetty from 9.3.14.v20161028 to 9.3.20.v20170531 (Michael Braun via David Smiley)
================== 7.0.0 ==================
Versions of Major Components
@ -321,6 +320,9 @@ Upgrading from Solr 6.x
* SOLR-11023: EnumField has been deprecated in favor of new EnumFieldType.
* SOLR-11239: The use of maxShardsPerNode is not supported when a cluster policy is in effect or
when a collection specific policy is specified during collection creation.
New Features
----------------------
* SOLR-9857, SOLR-9858: Collect aggregated metrics from nodes and shard leaders in overseer. (ab)
@ -508,6 +510,15 @@ Bug Fixes
* SOLR-10353: TestSQLHandler reproducible failure: No match found for function signature min(<NUMERIC>) (Kevin Risden)
* SOLR-11221: SolrJmxReporter broken on core reload. This resulted in some or most metrics not being reported
via JMX after core reloads, depending on timing. (ab)
* SOLR-11235: Some SolrCore metrics should check if core is closed before reporting. (ab)
* SOLR-10698: StreamHandler should allow connections to be closed early (Joel Bernstein, Varun Thacker, Erick Erickson)
* SOLR-11243: Replica Placement rules are ignored if a cluster policy exists. (shalin)
Optimizations
----------------------
@ -713,6 +724,12 @@ Other Changes
* SOLR-10821: Ref guide documentation for Autoscaling (Noble Paul, Cassandra Targett, shalin)
* SOLR-11239: A special value of -1 can be specified for 'maxShardsPerNode' to denote that
there is no limit. The bin/solr script send maxShardsPerNode=-1 when creating collections. The
use of maxShardsPerNode is not supported when a cluster policy is in effect or when a
collection specific policy is specified during collection creation.
(Noble Paul, shalin)
================== 6.7.0 ==================
Consult the LUCENE_CHANGES.txt file for additional, low level, changes in this release.
@ -748,9 +765,6 @@ New Features
* SOLR-10307: Allow Passing SSL passwords through environment variables. (Mano Kovacs, Michael Suzuki via Mark Miller)
* SOLR-10721: Provide a way to know when Core Discovery is finished and when all async cores are done loading
(Erick Erickson)
* SOLR-10379: Add ManagedSynonymGraphFilterFactory, deprecate ManagedSynonymFilterFactory. (Steve Rowe)
* SOLR-10479: Adds support for HttpShardHandlerFactory.loadBalancerRequests(MinimumAbsolute|MaximumFraction)
@ -840,20 +854,10 @@ when using one of Exact*StatsCache (Mikhail Khludnev)
* SOLR-10963: Fix example json in MultipleAdditiveTreesModel javadocs.
(Stefan Langenmaier via Christine Poerschke)
* SOLR-10910: Clean up a few details left over from pluggable transient core and untangling
CoreDescriptor/CoreContainer references (Erick Erickson)
* SOLR-10914: RecoveryStrategy's sendPrepRecoveryCmd can get stuck for 5 minutes if leader is unloaded. (shalin)
* SOLR-11024: ParallelStream should set the StreamContext when constructing SolrStreams (Joel Bernstein)
* SOLR-10908: CloudSolrStream.toExpression incorrectly handles fq clauses (Rohit Singh via Erick Erickson)
* SOLR-11198: downconfig downloads empty file as folder (Erick Erickson)
* SOLR-11177: CoreContainer.load needs to send lazily loaded core descriptors to the proper list rather than send
them all to the transient lists. (Erick Erickson) (note, not in 7.0, is in 7.1)
Optimizations
----------------------
* SOLR-10634: JSON Facet API: When a field/terms facet will retrieve all buckets (i.e. limit:-1)
@ -921,6 +925,24 @@ Bug Fixes
* SOLR-10857: standalone Solr loads UNLOADed core on request (Erick Erickson, Mikhail Khludnev)
* SOLR-11024: ParallelStream should set the StreamContext when constructing SolrStreams (Joel Bernstein)
* SOLR-10908: CloudSolrStream.toExpression incorrectly handles fq clauses (Rohit Singh via Erick Erickson)
* SOLR-11177: CoreContainer.load needs to send lazily loaded core descriptors to the proper list rather than send
them all to the transient lists. (Erick Erickson) (note, not in 7.0, is in 7.1)
* SOLR-11122: Creating a core should write a core.properties file first and clean up on failure
(Erick Erickson)
* SOLR-10910: Clean up a few details left over from pluggable transient core and untangling
CoreDescriptor/CoreContainer references (Erick Erickson)
* SOLR-10721: Provide a way to know when Core Discovery is finished and when all async cores are done loading
(Erick Erickson)
* SOLR-11069: CDCR bootstrapping can get into an infinite loop when a core is reloaded (Amrit Sarkar, Erick Erickson)
================== 6.6.0 ==================
Consult the LUCENE_CHANGES.txt file for additional, low level, changes in this release.

View File

@ -124,7 +124,7 @@ public class TestLTROnSolrCloud extends TestRerankBase {
// Test feature vectors returned (without re-ranking)
query.setFields("*,score,features:[fv]");
query.setFields("*,score,features:[fv store=test]");
queryResponse =
solrCluster.getSolrClient().query(COLLECTION,query);
assertEquals(8, queryResponse.getResults().getNumFound());
@ -146,7 +146,25 @@ public class TestLTROnSolrCloud extends TestRerankBase {
assertEquals(original_result6_score, queryResponse.getResults().get(6).get("score"));
assertEquals(original_result7_score, queryResponse.getResults().get(7).get("score"));
assertEquals(result7_features,
queryResponse.getResults().get(0).get("features").toString());
assertEquals(result6_features,
queryResponse.getResults().get(1).get("features").toString());
assertEquals(result5_features,
queryResponse.getResults().get(2).get("features").toString());
assertEquals(result4_features,
queryResponse.getResults().get(3).get("features").toString());
assertEquals(result3_features,
queryResponse.getResults().get(4).get("features").toString());
assertEquals(result2_features,
queryResponse.getResults().get(5).get("features").toString());
assertEquals(result1_features,
queryResponse.getResults().get(6).get("features").toString());
assertEquals(result0_features,
queryResponse.getResults().get(7).get("features").toString());
// Test feature vectors returned (with re-ranking)
query.setFields("*,score,features:[fv]");
query.add("rq", "{!ltr model=powpularityS-model reRankDocs=8}");
queryResponse =
solrCluster.getSolrClient().query(COLLECTION,query);

View File

@ -270,21 +270,7 @@ public class Assign {
}
}
if (policyName != null || !autoScalingConfig.getPolicy().getClusterPolicy().isEmpty()) {
if (message.getStr(CREATE_NODE_SET) == null)
nodeList = Collections.emptyList();// unless explicitly specified do not pass node list to Policy
synchronized (ocmh) {
PolicyHelper.SESSION_REF.set(ocmh.policySessionRef);
try {
return getPositionsUsingPolicy(collectionName,
shardNames, numNrtReplicas, numTlogReplicas, numPullReplicas, policyName, ocmh.zkStateReader, nodeList);
} finally {
PolicyHelper.SESSION_REF.remove();
}
}
} else {
log.debug("Identify nodes using rules framework");
if (rulesMap != null && !rulesMap.isEmpty()) {
List<Rule> rules = new ArrayList<>();
for (Object map : rulesMap) rules.add(new Rule((Map) map));
Map<String, Integer> sharVsReplicaCount = new HashMap<>();
@ -302,6 +288,19 @@ public class Assign {
return nodeMappings.entrySet().stream()
.map(e -> new ReplicaPosition(e.getKey().shard, e.getKey().index, e.getKey().type, e.getValue()))
.collect(Collectors.toList());
} else {
if (message.getStr(CREATE_NODE_SET) == null)
nodeList = Collections.emptyList();// unless explicitly specified do not pass node list to Policy
synchronized (ocmh) {
PolicyHelper.SESSION_REF.set(ocmh.policySessionRef);
try {
return getPositionsUsingPolicy(collectionName,
shardNames, numNrtReplicas, numTlogReplicas, numPullReplicas, policyName, ocmh.zkStateReader, nodeList);
} finally {
PolicyHelper.SESSION_REF.remove();
}
}
}
}

View File

@ -128,9 +128,11 @@ public class CreateCollectionCmd implements Cmd {
ClusterStateMutator.getShardNames(numSlices, shardNames);
}
int maxShardsPerNode = message.getInt(MAX_SHARDS_PER_NODE, usePolicyFramework? 0: 1);
if(maxShardsPerNode == 0) message.getProperties().put(MAX_SHARDS_PER_NODE, "0");
int maxShardsPerNode = message.getInt(MAX_SHARDS_PER_NODE, 1);
if (usePolicyFramework && message.getStr(MAX_SHARDS_PER_NODE) != null && maxShardsPerNode > 0) {
throw new SolrException(ErrorCode.BAD_REQUEST, "'maxShardsPerNode>0' is not supported when autoScaling policies are used");
}
if (maxShardsPerNode == -1 || usePolicyFramework) maxShardsPerNode = Integer.MAX_VALUE;
if (numNrtReplicas + numTlogReplicas <= 0) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, NRT_REPLICAS + " + " + TLOG_REPLICAS + " must be greater than 0");
}
@ -161,9 +163,11 @@ public class CreateCollectionCmd implements Cmd {
+ "). It's unusual to run two replica of the same slice on the same Solr-instance.");
}
int maxShardsAllowedToCreate = maxShardsPerNode * nodeList.size();
int maxShardsAllowedToCreate = maxShardsPerNode == Integer.MAX_VALUE ?
Integer.MAX_VALUE :
maxShardsPerNode * nodeList.size();
int requestedShardsToCreate = numSlices * totalNumReplicas;
if (!usePolicyFramework && maxShardsAllowedToCreate < requestedShardsToCreate) {
if (maxShardsAllowedToCreate < requestedShardsToCreate) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Cannot create collection " + collectionName + ". Value of "
+ MAX_SHARDS_PER_NODE + " is " + maxShardsPerNode
+ ", and the number of nodes currently live or live and part of your "+CREATE_NODE_SET+" is " + nodeList.size()
@ -496,4 +500,15 @@ public class CreateCollectionCmd implements Cmd {
"Could not find configName for collection " + collection + " found:" + configNames);
}
}
public static boolean usePolicyFramework(ZkStateReader zkStateReader, ZkNodeProps message) {
Map autoScalingJson = Collections.emptyMap();
try {
autoScalingJson = Utils.getJson(zkStateReader.getZkClient(), SOLR_AUTOSCALING_CONF_PATH, true);
} catch (Exception e) {
return false;
}
return autoScalingJson.get(Policy.CLUSTER_POLICY) != null || message.getStr(Policy.POLICY) != null;
}
}

View File

@ -1134,9 +1134,9 @@ public final class SolrCore implements SolrInfoBean, SolrMetricProducer, Closeab
manager.registerGauge(this, registry, () -> startTime, true, "startTime", Category.CORE.toString());
manager.registerGauge(this, registry, () -> getOpenCount(), true, "refCount", Category.CORE.toString());
manager.registerGauge(this, registry, () -> resourceLoader.getInstancePath().toString(), true, "instanceDir", Category.CORE.toString());
manager.registerGauge(this, registry, () -> getIndexDir(), true, "indexDir", Category.CORE.toString());
manager.registerGauge(this, registry, () -> getIndexSize(), true, "sizeInBytes", Category.INDEX.toString());
manager.registerGauge(this, registry, () -> NumberUtils.readableSize(getIndexSize()), true, "size", Category.INDEX.toString());
manager.registerGauge(this, registry, () -> isClosed() ? "(closed)" : getIndexDir(), true, "indexDir", Category.CORE.toString());
manager.registerGauge(this, registry, () -> isClosed() ? 0 : getIndexSize(), true, "sizeInBytes", Category.INDEX.toString());
manager.registerGauge(this, registry, () -> isClosed() ? "(closed)" : NumberUtils.readableSize(getIndexSize()), true, "size", Category.INDEX.toString());
if (coreContainer != null) {
manager.registerGauge(this, registry, () -> coreContainer.getNamesForCore(this), true, "aliases", Category.CORE.toString());
final CloudDescriptor cd = getCoreDescriptor().getCloudDescriptor();
@ -1489,6 +1489,16 @@ public final class SolrCore implements SolrInfoBean, SolrMetricProducer, Closeab
}
log.info(logid+" CLOSING SolrCore " + this);
// stop reporting metrics
try {
coreMetricManager.close();
} catch (Throwable e) {
SolrException.log(log, e);
if (e instanceof Error) {
throw (Error) e;
}
}
if( closeHooks != null ) {
for( CloseHook hook : closeHooks ) {
try {
@ -1587,15 +1597,6 @@ public final class SolrCore implements SolrInfoBean, SolrMetricProducer, Closeab
}
}
try {
coreMetricManager.close();
} catch (Throwable e) {
SolrException.log(log, e);
if (e instanceof Error) {
throw (Error) e;
}
}
// Close the snapshots meta-data directory.
Directory snapshotsDir = snapshotMgr.getSnapshotsDir();
try {

View File

@ -30,7 +30,6 @@ import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Future;
import java.util.concurrent.RejectedExecutionException;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.locks.Lock;
import org.apache.solr.client.solrj.SolrRequest;
@ -61,6 +60,7 @@ import org.apache.solr.request.SolrRequestHandler;
import org.apache.solr.request.SolrRequestInfo;
import org.apache.solr.response.SolrQueryResponse;
import org.apache.solr.update.CdcrUpdateLog;
import org.apache.solr.update.SolrCoreState;
import org.apache.solr.update.UpdateLog;
import org.apache.solr.update.VersionInfo;
import org.apache.solr.update.processor.DistributedUpdateProcessor;
@ -617,10 +617,6 @@ public class CdcrRequestHandler extends RequestHandlerBase implements SolrCoreAw
rsp.add(CdcrParams.ERRORS, hosts);
}
private AtomicBoolean running = new AtomicBoolean();
private volatile Future<Boolean> bootstrapFuture;
private volatile BootstrapCallable bootstrapCallable;
private void handleBootstrapAction(SolrQueryRequest req, SolrQueryResponse rsp) throws IOException, SolrServerException {
String collectionName = core.getCoreDescriptor().getCloudDescriptor().getCollectionName();
String shard = core.getCoreDescriptor().getCloudDescriptor().getShardId();
@ -633,14 +629,19 @@ public class CdcrRequestHandler extends RequestHandlerBase implements SolrCoreAw
Runnable runnable = () -> {
Lock recoveryLock = req.getCore().getSolrCoreState().getRecoveryLock();
boolean locked = recoveryLock.tryLock();
SolrCoreState coreState = core.getSolrCoreState();
try {
if (!locked) {
handleCancelBootstrap(req, rsp);
} else if (leaderStateManager.amILeader()) {
running.set(true);
coreState.setCdcrBootstrapRunning(true);
//running.set(true);
String masterUrl = req.getParams().get(ReplicationHandler.MASTER_URL);
bootstrapCallable = new BootstrapCallable(masterUrl, core);
bootstrapFuture = core.getCoreContainer().getUpdateShardHandler().getRecoveryExecutor().submit(bootstrapCallable);
BootstrapCallable bootstrapCallable = new BootstrapCallable(masterUrl, core);
coreState.setCdcrBootstrapCallable(bootstrapCallable);
Future<Boolean> bootstrapFuture = core.getCoreContainer().getUpdateShardHandler().getRecoveryExecutor()
.submit(bootstrapCallable);
coreState.setCdcrBootstrapFuture(bootstrapFuture);
try {
bootstrapFuture.get();
} catch (InterruptedException e) {
@ -654,7 +655,7 @@ public class CdcrRequestHandler extends RequestHandlerBase implements SolrCoreAw
}
} finally {
if (locked) {
running.set(false);
coreState.setCdcrBootstrapRunning(false);
recoveryLock.unlock();
}
}
@ -670,19 +671,20 @@ public class CdcrRequestHandler extends RequestHandlerBase implements SolrCoreAw
}
private void handleCancelBootstrap(SolrQueryRequest req, SolrQueryResponse rsp) {
BootstrapCallable callable = this.bootstrapCallable;
BootstrapCallable callable = (BootstrapCallable)core.getSolrCoreState().getCdcrBootstrapCallable();
IOUtils.closeQuietly(callable);
rsp.add(RESPONSE_STATUS, "cancelled");
}
private void handleBootstrapStatus(SolrQueryRequest req, SolrQueryResponse rsp) throws IOException, SolrServerException {
if (running.get()) {
SolrCoreState coreState = core.getSolrCoreState();
if (coreState.getCdcrBootstrapRunning()) {
rsp.add(RESPONSE_STATUS, RUNNING);
return;
}
Future<Boolean> future = bootstrapFuture;
BootstrapCallable callable = this.bootstrapCallable;
Future<Boolean> future = coreState.getCdcrBootstrapFuture();
BootstrapCallable callable = (BootstrapCallable)coreState.getCdcrBootstrapCallable();
if (future == null) {
rsp.add(RESPONSE_STATUS, "notfound");
rsp.add(RESPONSE_MESSAGE, "No bootstrap found in running, completed or failed states");

View File

@ -59,6 +59,7 @@ public class MetricsMap implements Gauge<Map<String,Object>>, DynamicMBean {
private final boolean useCachedStatsBetweenGetMBeanInfoCalls = Boolean.getBoolean("useCachedStatsBetweenGetMBeanInfoCalls");
private BiConsumer<Boolean, Map<String, Object>> initializer;
private Map<String, String> jmxAttributes = new HashMap<>();
private volatile Map<String,Object> cachedValue;
public MetricsMap(BiConsumer<Boolean, Map<String,Object>> initializer) {
@ -83,6 +84,11 @@ public class MetricsMap implements Gauge<Map<String,Object>>, DynamicMBean {
@Override
public Object getAttribute(String attribute) throws AttributeNotFoundException, MBeanException, ReflectionException {
Object val;
// jmxAttributes override any real values
val = jmxAttributes.get(attribute);
if (val != null) {
return val;
}
Map<String,Object> stats = null;
if (useCachedStatsBetweenGetMBeanInfoCalls) {
Map<String,Object> cachedStats = this.cachedValue;
@ -111,7 +117,7 @@ public class MetricsMap implements Gauge<Map<String,Object>>, DynamicMBean {
@Override
public void setAttribute(Attribute attribute) throws AttributeNotFoundException, InvalidAttributeValueException, MBeanException, ReflectionException {
throw new UnsupportedOperationException("Operation not Supported");
jmxAttributes.put(attribute.getName(), String.valueOf(attribute.getValue()));
}
@Override
@ -144,8 +150,15 @@ public class MetricsMap implements Gauge<Map<String,Object>>, DynamicMBean {
if (useCachedStatsBetweenGetMBeanInfoCalls) {
cachedValue = stats;
}
jmxAttributes.forEach((k, v) -> {
attrInfoList.add(new MBeanAttributeInfo(k, String.class.getName(),
null, true, false, false));
});
try {
stats.forEach((k, v) -> {
if (jmxAttributes.containsKey(k)) {
return;
}
Class type = v.getClass();
OpenType typeBox = determineType(type);
if (type.equals(String.class) || typeBox == null) {

View File

@ -16,26 +16,20 @@
*/
package org.apache.solr.metrics.reporters;
import javax.management.InstanceNotFoundException;
import javax.management.MBeanServer;
import javax.management.ObjectInstance;
import javax.management.ObjectName;
import java.lang.invoke.MethodHandles;
import java.util.HashSet;
import java.util.Locale;
import java.util.Set;
import com.codahale.metrics.Gauge;
import com.codahale.metrics.JmxReporter;
import com.codahale.metrics.MetricFilter;
import com.codahale.metrics.MetricRegistry;
import com.codahale.metrics.MetricRegistryListener;
import org.apache.solr.metrics.FilteringSolrMetricReporter;
import org.apache.solr.metrics.MetricsMap;
import org.apache.solr.metrics.SolrMetricManager;
import org.apache.solr.metrics.SolrMetricReporter;
import org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter;
import org.apache.solr.metrics.reporters.jmx.JmxObjectNameFactory;
import org.apache.solr.util.JmxUtil;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@ -57,10 +51,10 @@ public class SolrJmxReporter extends FilteringSolrMetricReporter {
private String serviceUrl;
private String rootName;
private JmxReporter reporter;
private MetricRegistry registry;
private MBeanServer mBeanServer;
private MetricsMapListener listener;
private JmxMetricsReporter reporter;
private boolean started;
/**
* Creates a new instance of {@link SolrJmxReporter}.
@ -102,51 +96,32 @@ public class SolrJmxReporter extends FilteringSolrMetricReporter {
}
JmxObjectNameFactory jmxObjectNameFactory = new JmxObjectNameFactory(pluginInfo.name, fullDomain);
registry = metricManager.registry(registryName);
final MetricFilter filter = newMetricFilter();
reporter = JmxReporter.forRegistry(registry)
final MetricFilter filter = newMetricFilter();
String tag = Integer.toHexString(this.hashCode());
reporter = JmxMetricsReporter.forRegistry(registry)
.registerWith(mBeanServer)
.inDomain(fullDomain)
.filter(filter)
.createsObjectNamesWith(jmxObjectNameFactory)
.withTag(tag)
.build();
reporter.start();
// workaround for inability to register custom MBeans (to be available in metrics 4.0?)
listener = new MetricsMapListener(mBeanServer, jmxObjectNameFactory);
registry.addListener(listener);
started = true;
log.info("JMX monitoring for '" + fullDomain + "' (registry '" + registryName + "') enabled at server: " + mBeanServer);
}
@Override
protected MetricFilter newMetricFilter() {
// filter out MetricsMap gauges - we have a better way of handling them
final MetricFilter mmFilter = (name, metric) -> !(metric instanceof MetricsMap);
final MetricFilter filter;
if (filters.isEmpty()) {
filter = mmFilter;
} else {
// apply also prefix filters
SolrMetricManager.PrefixFilter prefixFilter = new SolrMetricManager.PrefixFilter(filters);
filter = new SolrMetricManager.AndFilter(prefixFilter, mmFilter);
}
return filter;
}
/**
* Stops the reporter from publishing metrics.
*/
@Override
public synchronized void close() {
log.info("Closing reporter " + this + " for registry " + registryName + " / " + registry);
started = false;
if (reporter != null) {
reporter.close();
reporter = null;
}
if (listener != null && registry != null) {
registry.removeListener(listener);
listener.close();
listener = null;
}
}
/**
@ -238,70 +213,23 @@ public class SolrJmxReporter extends FilteringSolrMetricReporter {
/**
* For unit tests.
* @return true if this reporter is actively reporting metrics to JMX.
* @return true if this reporter is going to report metrics to JMX.
*/
public boolean isActive() {
return reporter != null;
}
/**
* For unit tests.
* @return true if this reporter has been started and is reporting metrics to JMX.
*/
public boolean isStarted() {
return started;
}
@Override
public String toString() {
return String.format(Locale.ENGLISH, "[%s@%s: rootName = %s, domain = %s, service url = %s, agent id = %s]",
getClass().getName(), Integer.toHexString(hashCode()), rootName, domain, serviceUrl, agentId);
}
private static class MetricsMapListener extends MetricRegistryListener.Base {
MBeanServer server;
JmxObjectNameFactory nameFactory;
// keep the names so that we can unregister them on core close
Set<ObjectName> registered = new HashSet<>();
MetricsMapListener(MBeanServer server, JmxObjectNameFactory nameFactory) {
this.server = server;
this.nameFactory = nameFactory;
}
@Override
public void onGaugeAdded(String name, Gauge<?> gauge) {
if (!(gauge instanceof MetricsMap)) {
return;
}
synchronized (server) {
try {
ObjectName objectName = nameFactory.createName("gauges", nameFactory.getDomain(), name);
log.debug("REGISTER " + objectName);
if (registered.contains(objectName) || server.isRegistered(objectName)) {
log.debug("-unregistering old instance of " + objectName);
try {
server.unregisterMBean(objectName);
} catch (InstanceNotFoundException e) {
// ignore
}
}
// some MBean servers re-write object name to include additional properties
ObjectInstance instance = server.registerMBean(gauge, objectName);
if (instance != null) {
registered.add(instance.getObjectName());
}
} catch (Exception e) {
log.warn("bean registration error", e);
}
}
}
public void close() {
synchronized (server) {
for (ObjectName name : registered) {
try {
if (server.isRegistered(name)) {
server.unregisterMBean(name);
}
} catch (Exception e) {
log.debug("bean unregistration error", e);
}
}
registered.clear();
}
}
}
}

View File

@ -0,0 +1,750 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.solr.metrics.reporters.jmx;
import javax.management.Attribute;
import javax.management.InstanceAlreadyExistsException;
import javax.management.InstanceNotFoundException;
import javax.management.JMException;
import javax.management.MBeanRegistrationException;
import javax.management.MBeanServer;
import javax.management.ObjectInstance;
import javax.management.ObjectName;
import javax.management.Query;
import javax.management.QueryExp;
import java.io.Closeable;
import java.lang.invoke.MethodHandles;
import java.lang.management.ManagementFactory;
import java.util.HashMap;
import java.util.Locale;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.TimeUnit;
import com.codahale.metrics.Counter;
import com.codahale.metrics.DefaultObjectNameFactory;
import com.codahale.metrics.Gauge;
import com.codahale.metrics.Histogram;
import com.codahale.metrics.Meter;
import com.codahale.metrics.Metered;
import com.codahale.metrics.Metric;
import com.codahale.metrics.MetricFilter;
import com.codahale.metrics.MetricRegistry;
import com.codahale.metrics.MetricRegistryListener;
import com.codahale.metrics.ObjectNameFactory;
import com.codahale.metrics.Reporter;
import com.codahale.metrics.Timer;
import org.apache.solr.metrics.MetricsMap;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* This is a modified copy of Dropwizard's {@link com.codahale.metrics.JmxReporter} and classes that it internally uses,
* with a few important differences:
* <ul>
* <li>this class knows that it can directly use {@link MetricsMap} as a dynamic MBean.</li>
* <li>this class allows us to "tag" MBean instances so that we can later unregister only instances registered with the
* same tag.</li>
* <li>this class processes all metrics already existing in the registry at the time when reporter is started.</li>
* </ul>
*/
public class JmxMetricsReporter implements Reporter, Closeable {
private static final Logger LOG = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
public static final String INSTANCE_TAG = "_instanceTag";
public static Builder forRegistry(MetricRegistry registry) {
return new Builder(registry);
}
public static class Builder {
private final MetricRegistry registry;
private MBeanServer mBeanServer;
private TimeUnit rateUnit;
private TimeUnit durationUnit;
private ObjectNameFactory objectNameFactory;
private MetricFilter filter = MetricFilter.ALL;
private String domain;
private String tag;
private Builder(MetricRegistry registry) {
this.registry = registry;
this.rateUnit = TimeUnit.SECONDS;
this.durationUnit = TimeUnit.MILLISECONDS;
this.domain = "metrics";
this.objectNameFactory = new DefaultObjectNameFactory();
}
/**
* Register MBeans with the given {@link MBeanServer}.
*
* @param mBeanServer an {@link MBeanServer}
* @return {@code this}
*/
public Builder registerWith(MBeanServer mBeanServer) {
this.mBeanServer = mBeanServer;
return this;
}
/**
* Convert rates to the given time unit.
*
* @param rateUnit a unit of time
* @return {@code this}
*/
public Builder convertRatesTo(TimeUnit rateUnit) {
this.rateUnit = rateUnit;
return this;
}
public Builder createsObjectNamesWith(ObjectNameFactory onFactory) {
if(onFactory == null) {
throw new IllegalArgumentException("null objectNameFactory");
}
this.objectNameFactory = onFactory;
return this;
}
/**
* Convert durations to the given time unit.
*
* @param durationUnit a unit of time
* @return {@code this}
*/
public Builder convertDurationsTo(TimeUnit durationUnit) {
this.durationUnit = durationUnit;
return this;
}
/**
* Only report metrics which match the given filter.
*
* @param filter a {@link MetricFilter}
* @return {@code this}
*/
public Builder filter(MetricFilter filter) {
this.filter = filter;
return this;
}
public Builder inDomain(String domain) {
this.domain = domain;
return this;
}
public Builder withTag(String tag) {
this.tag = tag;
return this;
}
public JmxMetricsReporter build() {
if (mBeanServer == null) {
mBeanServer = ManagementFactory.getPlatformMBeanServer();
}
if (tag == null) {
tag = Integer.toHexString(this.hashCode());
}
return new JmxMetricsReporter(mBeanServer, domain, registry, filter, rateUnit, durationUnit, objectNameFactory, tag);
}
}
// MBean interfaces and base classes
public interface MetricMBean {
ObjectName objectName();
// this strange-looking method name is used for producing "_instanceTag" attribute name
String get_instanceTag();
}
private abstract static class AbstractBean implements MetricMBean {
private final ObjectName objectName;
private final String instanceTag;
AbstractBean(ObjectName objectName, String instanceTag) {
this.objectName = objectName;
this.instanceTag = instanceTag;
}
@Override
public String get_instanceTag() {
return instanceTag;
}
@Override
public ObjectName objectName() {
return objectName;
}
}
public interface JmxGaugeMBean extends MetricMBean {
Object getValue();
}
private static class JmxGauge extends AbstractBean implements JmxGaugeMBean {
private final Gauge<?> metric;
private JmxGauge(Gauge<?> metric, ObjectName objectName, String tag) {
super(objectName, tag);
this.metric = metric;
}
@Override
public Object getValue() {
return metric.getValue();
}
}
public interface JmxCounterMBean extends MetricMBean {
long getCount();
}
private static class JmxCounter extends AbstractBean implements JmxCounterMBean {
private final Counter metric;
private JmxCounter(Counter metric, ObjectName objectName, String tag) {
super(objectName, tag);
this.metric = metric;
}
@Override
public long getCount() {
return metric.getCount();
}
}
public interface JmxHistogramMBean extends MetricMBean {
long getCount();
long getMin();
long getMax();
double getMean();
double getStdDev();
double get50thPercentile();
double get75thPercentile();
double get95thPercentile();
double get98thPercentile();
double get99thPercentile();
double get999thPercentile();
long[] values();
long getSnapshotSize();
}
private static class JmxHistogram extends AbstractBean implements JmxHistogramMBean {
private final Histogram metric;
private JmxHistogram(Histogram metric, ObjectName objectName, String tag) {
super(objectName, tag);
this.metric = metric;
}
@Override
public double get50thPercentile() {
return metric.getSnapshot().getMedian();
}
@Override
public long getCount() {
return metric.getCount();
}
@Override
public long getMin() {
return metric.getSnapshot().getMin();
}
@Override
public long getMax() {
return metric.getSnapshot().getMax();
}
@Override
public double getMean() {
return metric.getSnapshot().getMean();
}
@Override
public double getStdDev() {
return metric.getSnapshot().getStdDev();
}
@Override
public double get75thPercentile() {
return metric.getSnapshot().get75thPercentile();
}
@Override
public double get95thPercentile() {
return metric.getSnapshot().get95thPercentile();
}
@Override
public double get98thPercentile() {
return metric.getSnapshot().get98thPercentile();
}
@Override
public double get99thPercentile() {
return metric.getSnapshot().get99thPercentile();
}
@Override
public double get999thPercentile() {
return metric.getSnapshot().get999thPercentile();
}
@Override
public long[] values() {
return metric.getSnapshot().getValues();
}
public long getSnapshotSize() {
return metric.getSnapshot().size();
}
}
public interface JmxMeterMBean extends MetricMBean {
long getCount();
double getMeanRate();
double getOneMinuteRate();
double getFiveMinuteRate();
double getFifteenMinuteRate();
String getRateUnit();
}
private static class JmxMeter extends AbstractBean implements JmxMeterMBean {
private final Metered metric;
private final double rateFactor;
private final String rateUnit;
private JmxMeter(Metered metric, ObjectName objectName, TimeUnit rateUnit, String tag) {
super(objectName, tag);
this.metric = metric;
this.rateFactor = rateUnit.toSeconds(1);
this.rateUnit = ("events/" + calculateRateUnit(rateUnit)).intern();
}
@Override
public long getCount() {
return metric.getCount();
}
@Override
public double getMeanRate() {
return metric.getMeanRate() * rateFactor;
}
@Override
public double getOneMinuteRate() {
return metric.getOneMinuteRate() * rateFactor;
}
@Override
public double getFiveMinuteRate() {
return metric.getFiveMinuteRate() * rateFactor;
}
@Override
public double getFifteenMinuteRate() {
return metric.getFifteenMinuteRate() * rateFactor;
}
@Override
public String getRateUnit() {
return rateUnit;
}
private String calculateRateUnit(TimeUnit unit) {
final String s = unit.toString().toLowerCase(Locale.US);
return s.substring(0, s.length() - 1);
}
}
public interface JmxTimerMBean extends JmxMeterMBean {
double getMin();
double getMax();
double getMean();
double getStdDev();
double get50thPercentile();
double get75thPercentile();
double get95thPercentile();
double get98thPercentile();
double get99thPercentile();
double get999thPercentile();
long[] values();
String getDurationUnit();
}
private static class JmxTimer extends JmxMeter implements JmxTimerMBean {
private final Timer metric;
private final double durationFactor;
private final String durationUnit;
private JmxTimer(Timer metric,
ObjectName objectName,
TimeUnit rateUnit,
TimeUnit durationUnit, String tag) {
super(metric, objectName, rateUnit, tag);
this.metric = metric;
this.durationFactor = 1.0 / durationUnit.toNanos(1);
this.durationUnit = durationUnit.toString().toLowerCase(Locale.US);
}
@Override
public double get50thPercentile() {
return metric.getSnapshot().getMedian() * durationFactor;
}
@Override
public double getMin() {
return metric.getSnapshot().getMin() * durationFactor;
}
@Override
public double getMax() {
return metric.getSnapshot().getMax() * durationFactor;
}
@Override
public double getMean() {
return metric.getSnapshot().getMean() * durationFactor;
}
@Override
public double getStdDev() {
return metric.getSnapshot().getStdDev() * durationFactor;
}
@Override
public double get75thPercentile() {
return metric.getSnapshot().get75thPercentile() * durationFactor;
}
@Override
public double get95thPercentile() {
return metric.getSnapshot().get95thPercentile() * durationFactor;
}
@Override
public double get98thPercentile() {
return metric.getSnapshot().get98thPercentile() * durationFactor;
}
@Override
public double get99thPercentile() {
return metric.getSnapshot().get99thPercentile() * durationFactor;
}
@Override
public double get999thPercentile() {
return metric.getSnapshot().get999thPercentile() * durationFactor;
}
@Override
public long[] values() {
return metric.getSnapshot().getValues();
}
@Override
public String getDurationUnit() {
return durationUnit;
}
}
private static class JmxListener implements MetricRegistryListener {
private final String name;
private final MBeanServer mBeanServer;
private final MetricFilter filter;
private final TimeUnit rateUnit;
private final TimeUnit durationUnit;
private final Map<ObjectName, ObjectName> registered;
private final ObjectNameFactory objectNameFactory;
private final String tag;
private final QueryExp exp;
private JmxListener(MBeanServer mBeanServer, String name, MetricFilter filter, TimeUnit rateUnit, TimeUnit durationUnit,
ObjectNameFactory objectNameFactory, String tag) {
this.mBeanServer = mBeanServer;
this.name = name;
this.filter = filter;
this.rateUnit = rateUnit;
this.durationUnit = durationUnit;
this.registered = new ConcurrentHashMap<>();
this.objectNameFactory = objectNameFactory;
this.tag = tag;
this.exp = Query.eq(Query.attr(INSTANCE_TAG), Query.value(tag));
}
private void registerMBean(Object mBean, ObjectName objectName) throws InstanceAlreadyExistsException, JMException {
// remove previous bean if exists
if (mBeanServer.isRegistered(objectName)) {
if (LOG.isDebugEnabled()) {
Set<ObjectInstance> objects = mBeanServer.queryMBeans(objectName, null);
LOG.debug("## removing existing " + objects.size() + " bean(s) for " + objectName.getCanonicalName() + ", current tag=" + tag + ":");
for (ObjectInstance inst : objects) {
LOG.debug("## - tag=" + mBeanServer.getAttribute(inst.getObjectName(), INSTANCE_TAG));
}
}
mBeanServer.unregisterMBean(objectName);
}
ObjectInstance objectInstance = mBeanServer.registerMBean(mBean, objectName);
if (objectInstance != null) {
// the websphere mbeanserver rewrites the objectname to include
// cell, node & server info
// make sure we capture the new objectName for unregistration
registered.put(objectName, objectInstance.getObjectName());
} else {
registered.put(objectName, objectName);
}
LOG.debug("## registered " + objectInstance.getObjectName().getCanonicalName() + ", tag=" + tag);
}
private void unregisterMBean(ObjectName originalObjectName) throws InstanceNotFoundException, MBeanRegistrationException {
ObjectName objectName = registered.remove(originalObjectName);
if (objectName == null) {
objectName = originalObjectName;
}
Set<ObjectInstance> objects = mBeanServer.queryMBeans(objectName, exp);
for (ObjectInstance o : objects) {
LOG.debug("## Unregistered " + o.getObjectName().getCanonicalName() + ", tag=" + tag);
mBeanServer.unregisterMBean(o.getObjectName());
}
}
@Override
public void onGaugeAdded(String name, Gauge<?> gauge) {
try {
if (filter.matches(name, gauge)) {
final ObjectName objectName = createName("gauges", name);
if (gauge instanceof MetricsMap) {
((MetricsMap)gauge).setAttribute(new Attribute(INSTANCE_TAG, tag));
registerMBean(gauge, objectName);
} else {
registerMBean(new JmxGauge(gauge, objectName, tag), objectName);
}
}
} catch (InstanceAlreadyExistsException e) {
LOG.debug("Unable to register gauge", e);
} catch (JMException e) {
LOG.warn("Unable to register gauge", e);
}
}
@Override
public void onGaugeRemoved(String name) {
try {
final ObjectName objectName = createName("gauges", name);
unregisterMBean(objectName);
} catch (InstanceNotFoundException e) {
LOG.debug("Unable to unregister gauge", e);
} catch (MBeanRegistrationException e) {
LOG.warn("Unable to unregister gauge", e);
}
}
@Override
public void onCounterAdded(String name, Counter counter) {
try {
if (filter.matches(name, counter)) {
final ObjectName objectName = createName("counters", name);
registerMBean(new JmxCounter(counter, objectName, tag), objectName);
}
} catch (InstanceAlreadyExistsException e) {
LOG.debug("Unable to register counter", e);
} catch (JMException e) {
LOG.warn("Unable to register counter", e);
}
}
@Override
public void onCounterRemoved(String name) {
try {
final ObjectName objectName = createName("counters", name);
unregisterMBean(objectName);
} catch (InstanceNotFoundException e) {
LOG.debug("Unable to unregister counter", e);
} catch (MBeanRegistrationException e) {
LOG.warn("Unable to unregister counter", e);
}
}
@Override
public void onHistogramAdded(String name, Histogram histogram) {
try {
if (filter.matches(name, histogram)) {
final ObjectName objectName = createName("histograms", name);
registerMBean(new JmxHistogram(histogram, objectName, tag), objectName);
}
} catch (InstanceAlreadyExistsException e) {
LOG.debug("Unable to register histogram", e);
} catch (JMException e) {
LOG.warn("Unable to register histogram", e);
}
}
@Override
public void onHistogramRemoved(String name) {
try {
final ObjectName objectName = createName("histograms", name);
unregisterMBean(objectName);
} catch (InstanceNotFoundException e) {
LOG.debug("Unable to unregister histogram", e);
} catch (MBeanRegistrationException e) {
LOG.warn("Unable to unregister histogram", e);
}
}
@Override
public void onMeterAdded(String name, Meter meter) {
try {
if (filter.matches(name, meter)) {
final ObjectName objectName = createName("meters", name);
registerMBean(new JmxMeter(meter, objectName, rateUnit, tag), objectName);
}
} catch (InstanceAlreadyExistsException e) {
LOG.debug("Unable to register meter", e);
} catch (JMException e) {
LOG.warn("Unable to register meter", e);
}
}
@Override
public void onMeterRemoved(String name) {
try {
final ObjectName objectName = createName("meters", name);
unregisterMBean(objectName);
} catch (InstanceNotFoundException e) {
LOG.debug("Unable to unregister meter", e);
} catch (MBeanRegistrationException e) {
LOG.warn("Unable to unregister meter", e);
}
}
@Override
public void onTimerAdded(String name, Timer timer) {
try {
if (filter.matches(name, timer)) {
final ObjectName objectName = createName("timers", name);
registerMBean(new JmxTimer(timer, objectName, rateUnit, durationUnit, tag), objectName);
}
} catch (InstanceAlreadyExistsException e) {
LOG.debug("Unable to register timer", e);
} catch (JMException e) {
LOG.warn("Unable to register timer", e);
}
}
@Override
public void onTimerRemoved(String name) {
try {
final ObjectName objectName = createName("timers", name);
unregisterMBean(objectName);
} catch (InstanceNotFoundException e) {
LOG.debug("Unable to unregister timer", e);
} catch (MBeanRegistrationException e) {
LOG.warn("Unable to unregister timer", e);
}
}
private ObjectName createName(String type, String name) {
return objectNameFactory.createName(type, this.name, name);
}
void unregisterAll() {
for (ObjectName name : registered.keySet()) {
try {
unregisterMBean(name);
} catch (InstanceNotFoundException e) {
LOG.debug("Unable to unregister metric", e);
} catch (MBeanRegistrationException e) {
LOG.warn("Unable to unregister metric", e);
}
}
registered.clear();
}
}
private final MetricRegistry registry;
private final JmxListener listener;
private JmxMetricsReporter(MBeanServer mBeanServer,
String domain,
MetricRegistry registry,
MetricFilter filter,
TimeUnit rateUnit,
TimeUnit durationUnit,
ObjectNameFactory objectNameFactory,
String tag) {
this.registry = registry;
this.listener = new JmxListener(mBeanServer, domain, filter, rateUnit, durationUnit, objectNameFactory, tag);
}
public void start() {
registry.addListener(listener);
// process existing metrics
Map<String, Metric> metrics = new HashMap<>(registry.getMetrics());
metrics.forEach((k, v) -> {
if (v instanceof Counter) {
listener.onCounterAdded(k, (Counter)v);
} else if (v instanceof Meter) {
listener.onMeterAdded(k, (Meter)v);
} else if (v instanceof Histogram) {
listener.onHistogramAdded(k, (Histogram)v);
} else if (v instanceof Timer) {
listener.onTimerAdded(k, (Timer)v);
} else if (v instanceof Gauge) {
listener.onGaugeAdded(k, (Gauge)v);
} else {
LOG.warn("Unknown metric type " + v.getClass().getName() + " for metric '" + k + "', ignoring");
}
});
}
@Override
public void close() {
registry.removeListener(listener);
listener.unregisterAll();
}
}

View File

@ -14,7 +14,7 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.solr.metrics.reporters;
package org.apache.solr.metrics.reporters.jmx;
import javax.management.MalformedObjectNameException;
import javax.management.ObjectName;

View File

@ -0,0 +1,21 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* This package contains components that support {@link org.apache.solr.metrics.reporters.SolrJmxReporter}.
*/
package org.apache.solr.metrics.reporters.jmx;

View File

@ -18,10 +18,12 @@ package org.apache.solr.update;
import java.io.IOException;
import java.lang.invoke.MethodHandles;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;
import java.util.concurrent.RejectedExecutionException;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
@ -77,6 +79,13 @@ public final class DefaultSolrCoreState extends SolrCoreState implements Recover
protected final ReentrantLock commitLock = new ReentrantLock();
private AtomicBoolean cdcrRunning = new AtomicBoolean();
private volatile Future<Boolean> cdcrBootstrapFuture;
private volatile Callable cdcrBootstrapCallable;
@Deprecated
public DefaultSolrCoreState(DirectoryFactory directoryFactory) {
this(directoryFactory, new RecoveryStrategy.Builder());
@ -416,4 +425,34 @@ public final class DefaultSolrCoreState extends SolrCoreState implements Recover
public Lock getRecoveryLock() {
return recoveryLock;
}
@Override
public boolean getCdcrBootstrapRunning() {
return cdcrRunning.get();
}
@Override
public void setCdcrBootstrapRunning(boolean cdcrRunning) {
this.cdcrRunning.set(cdcrRunning);
}
@Override
public Future<Boolean> getCdcrBootstrapFuture() {
return cdcrBootstrapFuture;
}
@Override
public void setCdcrBootstrapFuture(Future<Boolean> cdcrBootstrapFuture) {
this.cdcrBootstrapFuture = cdcrBootstrapFuture;
}
@Override
public Callable getCdcrBootstrapCallable() {
return cdcrBootstrapCallable;
}
@Override
public void setCdcrBootstrapCallable(Callable cdcrBootstrapCallable) {
this.cdcrBootstrapCallable = cdcrBootstrapCallable;
}
}

View File

@ -18,6 +18,8 @@ package org.apache.solr.update;
import java.io.IOException;
import java.lang.invoke.MethodHandles;
import java.util.concurrent.Callable;
import java.util.concurrent.Future;
import java.util.concurrent.locks.Lock;
import org.apache.lucene.index.IndexWriter;
@ -177,4 +179,19 @@ public abstract class SolrCoreState {
}
public abstract Lock getRecoveryLock();
// These are needed to properly synchronize the bootstrapping when the
// in the target DC require a full sync.
public abstract boolean getCdcrBootstrapRunning();
public abstract void setCdcrBootstrapRunning(boolean cdcrRunning);
public abstract Future<Boolean> getCdcrBootstrapFuture();
public abstract void setCdcrBootstrapFuture(Future<Boolean> cdcrBootstrapFuture);
public abstract Callable getCdcrBootstrapCallable();
public abstract void setCdcrBootstrapCallable(Callable cdcrBootstrapCallable);
}

View File

@ -1502,10 +1502,6 @@ public class SolrCLI {
if (cli.hasOption("maxShardsPerNode")) {
maxShardsPerNode = Integer.parseInt(cli.getOptionValue("maxShardsPerNode"));
} else {
// need number of live nodes to determine maxShardsPerNode if it is not set
int numNodes = liveNodes.size();
maxShardsPerNode = ((numShards*replicationFactor)+numNodes-1)/numNodes;
}
String confname = cli.getOptionValue("confname");

View File

@ -655,36 +655,44 @@ public class AutoScalingHandlerTest extends SolrCloudTestCase {
assertNotNull(violations);
assertEquals(0, violations.size());
// temporarily increase replica limit in cluster policy so that we can create a collection with 6 replicas
setClusterPolicyCommand = "{" +
" 'set-cluster-policy': [" +
" {'cores':'<10', 'node':'#ANY'}," +
" {'replica':'<4', 'shard': '#EACH', 'node': '#ANY'}," +
" {'nodeRole':'overseer', 'replica':0}" +
" ]" +
String setEmptyClusterPolicyCommand = "{" +
" 'set-cluster-policy': []" +
"}";
req = createAutoScalingRequest(SolrRequest.METHOD.POST, setClusterPolicyCommand);
req = createAutoScalingRequest(SolrRequest.METHOD.POST, setEmptyClusterPolicyCommand);
response = solrClient.request(req);
assertEquals(response.get("result").toString(), "success");
req = createAutoScalingRequest(SolrRequest.METHOD.POST, "{set-cluster-policy : []}");
response = solrClient.request(req);
assertEquals(response.get("result").toString(), "success");
// lets create a collection which violates the rule replicas < 2
CollectionAdminResponse adminResponse = CollectionAdminRequest.Create
.createCollection("readApiTestViolations", CONFIGSET_NAME, 1, 6)
.process(solrClient);
try {
CollectionAdminRequest.Create create = CollectionAdminRequest.Create.createCollection("readApiTestViolations", CONFIGSET_NAME, 1, 6);
create.setMaxShardsPerNode(10);
create.process(solrClient);
fail();
} catch (Exception e) {
assertTrue(e.getMessage().contains("'maxShardsPerNode>0' is not supported when autoScaling policies are used"));
}
// lets create a collection which violates the rule replicas < 2
CollectionAdminRequest.Create create = CollectionAdminRequest.Create.createCollection("readApiTestViolations", CONFIGSET_NAME, 1, 6);
CollectionAdminResponse adminResponse = create.process(solrClient);
assertTrue(adminResponse.isSuccess());
// reset to original cluster policy which allows only 1 replica per shard per node
setClusterPolicyCommand = "{" +
" 'set-cluster-policy': [" +
" {'cores':'<10', 'node':'#ANY'}," +
" {'replica':'<2', 'shard': '#EACH', 'node': '#ANY'}," +
" {'nodeRole':'overseer', 'replica':0}" +
" ]" +
"}";
// reset the original cluster policy
req = createAutoScalingRequest(SolrRequest.METHOD.POST, setClusterPolicyCommand);
response = solrClient.request(req);
assertEquals(response.get("result").toString(), "success");
req = createAutoScalingRequest(SolrRequest.METHOD.POST, setClusterPolicyCommand);
response = solrClient.request(req);
assertEquals(response.get("result").toString(), "success");
// get the diagnostics output again
req = createAutoScalingRequest(SolrRequest.METHOD.GET, "/diagnostics", null);
response = solrClient.request(req);

View File

@ -23,6 +23,7 @@ import java.util.Map;
import org.apache.lucene.util.LuceneTestCase;
import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrRequest;
import org.apache.solr.client.solrj.embedded.JettySolrRunner;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.request.CollectionAdminRequest;
@ -40,6 +41,7 @@ import org.slf4j.LoggerFactory;
import static org.apache.solr.client.solrj.SolrRequest.METHOD.GET;
import static org.apache.solr.client.solrj.SolrRequest.METHOD.POST;
import static org.apache.solr.cloud.autoscaling.AutoScalingHandlerTest.createAutoScalingRequest;
import static org.apache.solr.common.params.CommonParams.COLLECTIONS_HANDLER_PATH;
import static org.junit.matchers.JUnitMatchers.containsString;
@ -91,6 +93,41 @@ public class RulesTest extends SolrCloudTestCase {
}
@Test
public void testPortRuleInPresenceOfClusterPolicy() throws Exception {
JettySolrRunner jetty = cluster.getRandomJetty(random());
String port = Integer.toString(jetty.getLocalPort());
// this cluster policy prohibits having any replicas on a node with the above port
String setClusterPolicyCommand = "{" +
" 'set-cluster-policy': [" +
" {'replica': 0, 'port':'" + port + "'}" +
" ]" +
"}";
SolrRequest req = createAutoScalingRequest(SolrRequest.METHOD.POST, setClusterPolicyCommand);
cluster.getSolrClient().request(req);
// but this collection is created with a replica placement rule that says all replicas must be created
// on a node with above port (in direct conflict with the cluster policy)
String rulesColl = "portRuleColl2";
CollectionAdminRequest.createCollectionWithImplicitRouter(rulesColl, "conf", "shard1", 2)
.setRule("port:" + port)
.setSnitch("class:ImplicitSnitch")
.process(cluster.getSolrClient());
// now we assert that the replica placement rule is used instead of the cluster policy
DocCollection rulesCollection = getCollectionState(rulesColl);
List list = (List) rulesCollection.get("rule");
assertEquals(1, list.size());
assertEquals(port, ((Map) list.get(0)).get("port"));
list = (List) rulesCollection.get("snitch");
assertEquals(1, list.size());
assertEquals ( "ImplicitSnitch", ((Map)list.get(0)).get("class"));
boolean allOnExpectedPort = rulesCollection.getReplicas().stream().allMatch(replica -> replica.getNodeName().contains(port));
assertTrue("Not all replicas were found to be on port: " + port + ". Collection state is: " + rulesCollection, allOnExpectedPort);
}
@Test
public void testPortRule() throws Exception {

View File

@ -18,7 +18,7 @@ package org.apache.solr.core;
import org.apache.solr.metrics.SolrMetricManager;
import org.apache.solr.metrics.SolrMetricReporter;
import org.apache.solr.metrics.reporters.JmxObjectNameFactory;
import org.apache.solr.metrics.reporters.jmx.JmxObjectNameFactory;
import org.apache.solr.metrics.reporters.SolrJmxReporter;
import org.apache.solr.util.AbstractSolrTestCase;
import org.junit.AfterClass;

View File

@ -0,0 +1,113 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.solr.metrics.reporters;
import javax.management.MBeanServer;
import javax.management.ObjectInstance;
import javax.management.Query;
import javax.management.QueryExp;
import java.lang.invoke.MethodHandles;
import java.lang.management.ManagementFactory;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import org.apache.solr.client.solrj.embedded.JettySolrRunner;
import org.apache.solr.client.solrj.impl.CloudSolrClient;
import org.apache.solr.client.solrj.request.CollectionAdminRequest;
import org.apache.solr.cloud.SolrCloudTestCase;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.metrics.SolrMetricManager;
import org.apache.solr.metrics.SolrMetricReporter;
import org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter;
import org.junit.BeforeClass;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
*
*/
public class SolrJmxReporterCloudTest extends SolrCloudTestCase {
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
private static MBeanServer mBeanServer;
private static String COLLECTION = SolrJmxReporterCloudTest.class.getSimpleName() + "_collection";
@BeforeClass
public static void setupCluster() throws Exception {
// make sure there's an MBeanServer
mBeanServer = ManagementFactory.getPlatformMBeanServer();
configureCluster(1)
.addConfig("conf", configset("cloud-minimal"))
.configure();
CollectionAdminRequest.createCollection(COLLECTION, "conf", 2, 1)
.setMaxShardsPerNode(2)
.process(cluster.getSolrClient());
}
@Test
public void testJmxReporter() throws Exception {
CollectionAdminRequest.reloadCollection(COLLECTION).processAndWait(cluster.getSolrClient(), 60);
CloudSolrClient solrClient = cluster.getSolrClient();
// index some docs
for (int i = 0; i < 100; i++) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "id-" + i);
solrClient.add(COLLECTION, doc);
}
solrClient.commit(COLLECTION);
// make sure searcher is present
solrClient.query(COLLECTION, params(CommonParams.Q, "*:*"));
for (JettySolrRunner runner : cluster.getJettySolrRunners()) {
SolrMetricManager manager = runner.getCoreContainer().getMetricManager();
for (String registry : manager.registryNames()) {
Map<String, SolrMetricReporter> reporters = manager.getReporters(registry);
long jmxReporters = reporters.entrySet().stream().filter(e -> e.getValue() instanceof SolrJmxReporter).count();
reporters.forEach((k, v) -> {
if (!(v instanceof SolrJmxReporter)) {
return;
}
if (!((SolrJmxReporter)v).getDomain().startsWith("solr.core")) {
return;
}
if (!((SolrJmxReporter)v).isActive()) {
return;
}
QueryExp exp = Query.eq(Query.attr(JmxMetricsReporter.INSTANCE_TAG), Query.value(Integer.toHexString(v.hashCode())));
Set<ObjectInstance> beans = mBeanServer.queryMBeans(null, exp);
if (((SolrJmxReporter) v).isStarted() && beans.isEmpty() && jmxReporters < 2) {
log.info("DocCollection: " + getCollectionState(COLLECTION));
fail("JMX reporter " + k + " for registry " + registry + " failed to register any beans!");
} else {
Set<String> categories = new HashSet<>();
beans.forEach(bean -> {
String cat = bean.getObjectName().getKeyProperty("category");
if (cat != null) {
categories.add(cat);
}
});
log.info("Registered categories: " + categories);
assertTrue("Too few categories: " + categories, categories.size() > 5);
}
});
}
}
}
}

View File

@ -16,6 +16,7 @@
*/
package org.apache.solr.metrics.reporters;
import javax.management.InstanceNotFoundException;
import javax.management.MBeanServer;
import javax.management.ObjectInstance;
import javax.management.ObjectName;
@ -174,6 +175,41 @@ public class SolrJmxReporterTest extends SolrTestCaseJ4 {
rootName.equals(o.getObjectName().getDomain())).count());
}
private static boolean stopped = false;
@Test
public void testClosedCore() throws Exception {
Set<ObjectInstance> objects = mBeanServer.queryMBeans(new ObjectName("*:category=CORE,name=indexDir,*"), null);
assertEquals("Unexpected number of indexDir beans: " + objects.toString(), 1, objects.size());
final ObjectInstance inst = objects.iterator().next();
stopped = false;
try {
Thread t = new Thread() {
public void run() {
while (!stopped) {
try {
Object value = mBeanServer.getAttribute(inst.getObjectName(), "Value");
assertNotNull(value);
} catch (InstanceNotFoundException x) {
// no longer present
break;
} catch (Exception e) {
fail("Unexpected error retrieving attribute: " + e.toString());
}
}
}
};
t.start();
Thread.sleep(500);
h.getCoreContainer().unload(h.getCore().getName());
Thread.sleep(2000);
objects = mBeanServer.queryMBeans(new ObjectName("*:category=CORE,name=indexDir,*"), null);
assertEquals("Unexpected number of beans after core closed: " + objects, 0, objects.size());
} finally {
stopped = true;
}
}
@Test
public void testEnabled() throws Exception {
String root1 = PREFIX + TestUtil.randomSimpleString(random(), 5, 10);

View File

@ -28,7 +28,7 @@ import org.apache.solr.core.SolrCore;
import org.apache.solr.metrics.AggregateMetric;
import org.apache.solr.metrics.SolrMetricManager;
import org.apache.solr.metrics.SolrMetricReporter;
import org.apache.solr.util.JmxUtil;
import org.apache.solr.metrics.reporters.SolrJmxReporter;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;
@ -39,13 +39,13 @@ import org.junit.Test;
public class SolrCloudReportersTest extends SolrCloudTestCase {
int leaderRegistries;
int clusterRegistries;
static int jmxReporter;
int jmxReporter;
@BeforeClass
public static void configureDummyCluster() throws Exception {
configureCluster(0).configure();
jmxReporter = JmxUtil.findFirstMBeanServer() != null ? 1 : 0;
}
@Before
@ -99,6 +99,12 @@ public class SolrCloudReportersTest extends SolrCloudTestCase {
assertEquals(5, sor.getPeriod());
for (String registryName : metricManager.registryNames(".*\\.shard[0-9]\\.replica.*")) {
reporters = metricManager.getReporters(registryName);
jmxReporter = 0;
reporters.forEach((k, v) -> {
if (v instanceof SolrJmxReporter) {
jmxReporter++;
}
});
assertEquals(reporters.toString(), 1 + jmxReporter, reporters.size());
reporter = null;
for (String name : reporters.keySet()) {
@ -158,6 +164,12 @@ public class SolrCloudReportersTest extends SolrCloudTestCase {
assertEquals(reporters.toString(), 0, reporters.size());
for (String registryName : metricManager.registryNames(".*\\.shard[0-9]\\.replica.*")) {
reporters = metricManager.getReporters(registryName);
jmxReporter = 0;
reporters.forEach((k, v) -> {
if (v instanceof SolrJmxReporter) {
jmxReporter++;
}
});
assertEquals(reporters.toString(), 0 + jmxReporter, reporters.size());
}
});

View File

@ -40,6 +40,7 @@ import org.apache.commons.exec.ExecuteResultHandler;
import org.apache.lucene.util.LuceneTestCase;
import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrRequest;
import org.apache.solr.client.solrj.embedded.JettyConfig;
import org.apache.solr.client.solrj.embedded.JettySolrRunner;
import org.apache.solr.client.solrj.impl.CloudSolrClient;
@ -47,6 +48,7 @@ import org.apache.solr.client.solrj.impl.HttpSolrClient;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.cloud.MiniSolrCloudCluster;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.util.NamedList;
import org.junit.After;
import org.junit.AfterClass;
import org.junit.BeforeClass;
@ -54,6 +56,8 @@ import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import static org.apache.solr.cloud.autoscaling.AutoScalingHandlerTest.createAutoScalingRequest;
/**
* Tests the SolrCLI.RunExampleTool implementation that supports bin/solr -e [example]
*/
@ -479,6 +483,115 @@ public class TestSolrCLIRunExample extends SolrTestCaseJ4 {
}
}
File node1SolrHome = new File(solrExampleDir, "cloud/node1/solr");
if (!node1SolrHome.isDirectory()) {
fail(node1SolrHome.getAbsolutePath() + " not found! run cloud example failed; tool output: " + toolOutput);
}
// delete the collection
SolrCLI.DeleteTool deleteTool = new SolrCLI.DeleteTool(stdoutSim);
String[] deleteArgs = new String[]{"-name", collectionName, "-solrUrl", solrUrl};
deleteTool.runTool(
SolrCLI.processCommandLineArgs(SolrCLI.joinCommonAndToolOptions(deleteTool.getOptions()), deleteArgs));
// dump all the output written by the SolrCLI commands to stdout
//System.out.println(toolOutput);
// stop the test instance
executor.execute(org.apache.commons.exec.CommandLine.parse("bin/solr stop -p " + bindPort));
}
@Test
public void testInteractiveSolrCloudExampleWithAutoScalingPolicy() throws Exception {
File solrHomeDir = new File(ExternalPaths.SERVER_HOME);
if (!solrHomeDir.isDirectory())
fail(solrHomeDir.getAbsolutePath() + " not found and is required to run this test!");
Path tmpDir = createTempDir();
File solrExampleDir = tmpDir.toFile();
File solrServerDir = solrHomeDir.getParentFile();
String[] toolArgs = new String[]{
"-example", "cloud",
"-serverDir", solrServerDir.getAbsolutePath(),
"-exampleDir", solrExampleDir.getAbsolutePath()
};
int bindPort = -1;
try (ServerSocket socket = new ServerSocket(0)) {
bindPort = socket.getLocalPort();
}
String collectionName = "testCloudExamplePrompt1";
// this test only support launching one SolrCloud node due to how MiniSolrCloudCluster works
// and the need for setting the host and port system properties ...
String userInput = "1\n" + bindPort + "\n" + collectionName + "\n2\n2\n_default\n";
// simulate user input from stdin
InputStream userInputSim = new ByteArrayInputStream(userInput.getBytes(StandardCharsets.UTF_8));
// capture tool output to stdout
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PrintStream stdoutSim = new PrintStream(baos, true, StandardCharsets.UTF_8.name());
RunExampleExecutor executor = new RunExampleExecutor(stdoutSim);
closeables.add(executor);
SolrCLI.RunExampleTool tool = new SolrCLI.RunExampleTool(executor, userInputSim, stdoutSim);
try {
tool.runTool(SolrCLI.processCommandLineArgs(SolrCLI.joinCommonAndToolOptions(tool.getOptions()), toolArgs));
} catch (Exception e) {
System.err.println("RunExampleTool failed due to: " + e +
"; stdout from tool prior to failure: " + baos.toString(StandardCharsets.UTF_8.name()));
throw e;
}
String toolOutput = baos.toString(StandardCharsets.UTF_8.name());
// verify Solr is running on the expected port and verify the collection exists
String solrUrl = "http://localhost:" + bindPort + "/solr";
String collectionListUrl = solrUrl + "/admin/collections?action=list";
if (!SolrCLI.safeCheckCollectionExists(collectionListUrl, collectionName)) {
fail("After running Solr cloud example, test collection '" + collectionName +
"' not found in Solr at: " + solrUrl + "; tool output: " + toolOutput);
}
// index some docs - to verify all is good for both shards
CloudSolrClient cloudClient = null;
try {
cloudClient = getCloudSolrClient(executor.solrCloudCluster.getZkServer().getZkAddress());
String setClusterPolicyCommand = "{" +
" 'set-cluster-policy': [" +
" {'cores':'<10', 'node':'#ANY'}," +
" {'replica':'<2', 'shard': '#EACH', 'node': '#ANY'}," +
" {'nodeRole':'overseer', 'replica':0}" +
" ]" +
"}";
SolrRequest req = createAutoScalingRequest(SolrRequest.METHOD.POST, setClusterPolicyCommand);
NamedList<Object> response = cloudClient.request(req);
assertEquals(response.get("result").toString(), "success");
SolrCLI.CreateCollectionTool createCollectionTool = new SolrCLI.CreateCollectionTool(stdoutSim);
String[] createArgs = new String[]{"create_collection", "-name", "newColl", "-configsetsDir", "_default", "-solrUrl", solrUrl};
createCollectionTool.runTool(
SolrCLI.processCommandLineArgs(SolrCLI.joinCommonAndToolOptions(createCollectionTool.getOptions()), createArgs));
solrUrl = "http://localhost:" + bindPort + "/solr";
collectionListUrl = solrUrl + "/admin/collections?action=list";
if (!SolrCLI.safeCheckCollectionExists(collectionListUrl, "newColl")) {
toolOutput = baos.toString(StandardCharsets.UTF_8.name());
fail("After running Solr cloud example, test collection 'newColl' not found in Solr at: " + solrUrl + "; tool output: " + toolOutput);
}
} finally {
if (cloudClient != null) {
try {
cloudClient.close();
} catch (Exception ignore) {
}
}
}
File node1SolrHome = new File(solrExampleDir, "cloud/node1/solr");
if (!node1SolrHome.isDirectory()) {
fail(node1SolrHome.getAbsolutePath()+" not found! run cloud example failed; tool output: "+toolOutput);
@ -487,6 +600,10 @@ public class TestSolrCLIRunExample extends SolrTestCaseJ4 {
// delete the collection
SolrCLI.DeleteTool deleteTool = new SolrCLI.DeleteTool(stdoutSim);
String[] deleteArgs = new String[] { "-name", collectionName, "-solrUrl", solrUrl };
deleteTool.runTool(
SolrCLI.processCommandLineArgs(SolrCLI.joinCommonAndToolOptions(deleteTool.getOptions()), deleteArgs));
deleteTool = new SolrCLI.DeleteTool(stdoutSim);
deleteArgs = new String[]{"-name", "newColl", "-solrUrl", solrUrl};
deleteTool.runTool(
SolrCLI.processCommandLineArgs(SolrCLI.joinCommonAndToolOptions(deleteTool.getOptions()), deleteArgs));
@ -496,7 +613,7 @@ public class TestSolrCLIRunExample extends SolrTestCaseJ4 {
// stop the test instance
executor.execute(org.apache.commons.exec.CommandLine.parse("bin/solr stop -p "+bindPort));
}
@Test
public void testFailExecuteScript() throws Exception {
File solrHomeDir = new File(ExternalPaths.SERVER_HOME);

View File

@ -1 +0,0 @@
4ba272cee2e367766dfdc1901c960de352160d41

View File

@ -0,0 +1 @@
0176f1ef8366257e7b6214c3bbd710cf47593135

View File

@ -1 +0,0 @@
f2aae796f4643180b4e4a159dafc4403e6b25ca7

View File

@ -0,0 +1 @@
160c0cefd2fddacd040c41801f40a5a372a9302c

View File

@ -1 +0,0 @@
ea3800883f79f757b2635a737bb71bb21e90cf19

View File

@ -0,0 +1 @@
32f5fe22ed468a49df1ffcbb27c39c1b53f261aa

View File

@ -1 +0,0 @@
52d796b58c3a997e59e6b47c4bf022cedcba3514

View File

@ -0,0 +1 @@
5b68e7761fcacefcf26ad9ab50943db65fda2c3d

View File

@ -1 +0,0 @@
d4829a57973c36f117792455024684bb6a5202aa

View File

@ -0,0 +1 @@
4a28dd045b8992752ff7727f25cf9e888e9c8c4c

View File

@ -1 +0,0 @@
823899b9456b3337422e0d98851cfe7842ef2516

View File

@ -0,0 +1 @@
8fb029863ceb6531ee0e24c59a004f622226217b

View File

@ -1 +0,0 @@
68be91fa1bcc82eed1709d36e6a85db7d5aff331

View File

@ -0,0 +1 @@
9e2ded957c05f447a0611fa64ca4ab5f7cc5aa65

View File

@ -1 +0,0 @@
791df6c55ad62841ff518ba6634e905a95567260

View File

@ -0,0 +1 @@
6a1523d44ebb527eed068a5c8bfd22edd6a20530

View File

@ -1 +0,0 @@
b5714a6005387b2a361d5b39a3a37d4df1892e62

View File

@ -0,0 +1 @@
21a698f9d58d03cdf58bf2a40f93de58c2eab138

View File

@ -1 +0,0 @@
6f49da101a1c3cd1ccd78ac38391bbc36619658e

View File

@ -0,0 +1 @@
0bb3b1ddc06525eba71c37f51402996502d323a9

View File

@ -1 +0,0 @@
fbf89f6f3b995992f82ec09104ab9a75d31d281b

View File

@ -0,0 +1 @@
19ce4203809da37f8ea7a5632704fa71b6f0ccc2

View File

@ -1 +0,0 @@
c9ad20bd632ffe1d8e4631f2ed185310db258f48

View File

@ -0,0 +1 @@
5b41166ce279c481216501d45c0d0f4f6da23c0b

View File

@ -1 +0,0 @@
3054375490c577ee6156a4b63ec262a39b36fc7e

View File

@ -0,0 +1 @@
9f3f158a6a4587c4283561a3a3fc5a187173becf

View File

@ -1 +1 @@
122f8028ab12222c9c9b6a7861d9cd3cc5d2ad45
68b040771da53967c7e48f2ffd7c53732687f425

View File

@ -18,7 +18,7 @@
// specific language governing permissions and limitations
// under the License.
Having had some fun with Solr, you will now learn about all the cool things it can do.
Solr is a search server built on top of Apache Lucene, an open source, Java-based, information retrieval library. It is designed to drive powerful document retrieval applications - wherever you need to serve data to users based on their queries, Solr can work for you.
Here is a example of how Solr might be integrated into an application:
@ -30,13 +30,12 @@ In the scenario above, Solr runs along side other server applications. For examp
Solr makes it easy to add the capability to search through the online store through the following steps:
. Define a _schema_. The schema tells Solr about the contents of documents it will be indexing. In the online store example, the schema would define fields for the product name, description, price, manufacturer, and so on. Solr's schema is powerful and flexible and allows you to tailor Solr's behavior to your application. See <<documents-fields-and-schema-design.adoc#documents-fields-and-schema-design,Documents, Fields, and Schema Design>> for all the details.
. Deploy Solr.
. Feed Solr documents for which your users will search.
. Expose search functionality in your application.
Because Solr is based on open standards, it is highly extensible. Solr queries are RESTful, which means, in essence, that a query is a simple HTTP request URL and the response is a structured document: mainly XML, but it could also be JSON, CSV, or some other format. This means that a wide variety of clients will be able to use Solr, from other web applications to browser clients, rich client applications, and mobile devices. Any platform capable of HTTP can talk to Solr. See <<client-apis.adoc#client-apis,Client APIs>> for details on client APIs.
Because Solr is based on open standards, it is highly extensible. Solr queries are simple HTTP request URLs and the response is a structured document: mainly JSON, but it could also be XML, CSV, or other formats. This means that a wide variety of clients will be able to use Solr, from other web applications to browser clients, rich client applications, and mobile devices. Any platform capable of HTTP can talk to Solr. See <<client-apis.adoc#client-apis,Client APIs>> for details on client APIs.
Solr is based on the Apache Lucene project, a high-performance, full-featured search engine. Solr offers support for the simplest keyword searching through to complex queries on multiple fields and faceted search results. <<searching.adoc#searching,Searching>> has more information about searching and queries.
Solr offers support for the simplest keyword searching through to complex queries on multiple fields and faceted search results. <<searching.adoc#searching,Searching>> has more information about searching and queries.
If Solr's capabilities are not impressive enough, its ability to handle very high-volume applications should do the trick.
@ -44,6 +43,4 @@ A relatively common scenario is that you have so much data, or so many queries,
For example: "Sharding" is a scaling technique in which a collection is split into multiple logical pieces called "shards" in order to scale up the number of documents in a collection beyond what could physically fit on a single server. Incoming queries are distributed to every shard in the collection, which respond with merged results. Another technique available is to increase the "Replication Factor" of your collection, which allows you to add servers with additional copies of your collection to handle higher concurrent query load by spreading the requests around to multiple machines. Sharding and Replication are not mutually exclusive, and together make Solr an extremely powerful and scalable platform.
Best of all, this talk about high-volume applications is not just hypothetical: some of the famous Internet sites that use Solr today are Macy's, EBay, and Zappo's.
For more information, take a look at https://wiki.apache.org/solr/PublicServers.
Best of all, this talk about high-volume applications is not just hypothetical: some of the famous Internet sites that use Solr today are Macy's, EBay, and Zappo's. For more examples, take a look at https://wiki.apache.org/solr/PublicServers.

View File

@ -18,29 +18,29 @@
// specific language governing permissions and limitations
// under the License.
Cross Data Center Replication (CDCR) allows you to create multiple SolrCloud data centers and keep them in sync in case they are needed at a future time.
Cross Data Center Replication (CDCR) allows you to create multiple SolrCloud data centers and keep them in sync.
The <<solrcloud.adoc#solrcloud,SolrCloud>> architecture is not particularly well suited for situations where a single SolrCloud cluster consists of nodes in separated data clusters connected by an expensive pipe. The root problem is that SolrCloud is designed to support <<near-real-time-searching.adoc#near-real-time-searching,Near Real Time Searching>> by immediately forwarding updates between nodes in the cluster on a per-shard basis. "CDCR" features exist to help mitigate the risk of an entire data center outage.
The <<solrcloud.adoc#solrcloud,SolrCloud>> architecture is designed to support <<near-real-time-searching.adoc#near-real-time-searching,Near Real Time Searching>> (NRT) searches on a Solr collection usually consisting of multiple nodes in a single data center. "CDCR" augments this model by forwarding updates from a Solr collection in one data center to a parallel Solr collection in another data center where the network latencies are greater than the SolrCloud model was designed to accommodate.
== What is CDCR?
CDCR supports replicating data from one data center to multiple data centers. The initial version of the solution supports an active-passive scenario where data updates are replicated from a Source data center to one or more target data centers.
CDCR supports replicating data from one data center to multiple data centers. The initial version of the solution supports a uni-directional scenario where data updates are replicated from a Source data center to one or more Target data centers.
The target data center(s) will not propagate updates such as adds, updates, or deletes to the source data center and updates should _not_ be sent to any of the target data center(s).
The Target data center(s) will not propagate updates such as adds, updates, or deletes to the Source data center and updates should _not_ be sent to any of the Target data center(s).
Source and target data centers can serve search queries when CDCR is operating. The target data centers will have slightly stale views of the corpus due to propagation delays, but this is minimal (perhaps a few seconds).
Source and Target data centers can serve search queries when CDCR is operating. The Target data centers will lag somewhat behind the Source cluster due to propagation delays.
Data changes on the source data center are replicated to the target data center only after they are persisted to disk. The data changes can be replicated in near real-time (with a small delay) or could be scheduled to be sent in intervals to the target data center. This solution pre-supposes that the source and target data centers begin with the same documents indexed. Of course the indexes may be empty to start.
Data changes on the Source data center are replicated to the Target data center only after they are persisted to disk. The data changes can be replicated in near real-time (with a small delay) or could be scheduled to be sent at longer intervals to the Target data center. CDCR can "bootstrap" the collection to the Target data center. Since this is a full copy of the entire index, network bandwidth should be considered. Of course both Source and Target collections may be empty to start.
Each shard leader in the source data center will be responsible for replicating its updates to the corresponding leader in the target data center. When receiving updates from the source data center, shard leaders in the target data center will replicate the changes to their own replicas.
Each shard leader in the Source data center will be responsible for replicating its updates to the corresponding leader in the Target data center. When receiving updates from the Source data center, shard leaders in the Target data center will replicate the changes to their own replicas as normal SolrCloud updates.
This replication model is designed to tolerate some degradation in connectivity, accommodate limited bandwidth, and support batch updates to optimize communication.
Replication supports both a new empty index and pre-built indexes. In the scenario where the replication is set up on a pre-built index, CDCR will ensure consistency of the replication of the updates, but cannot ensure consistency on the full index. Therefore any index created before CDCR was set up will have to be replicated by other means (described in the section <<Initial Startup>>) so source and target indexes are fully consistent.
Replication supports both a new empty index and pre-built indexes. In the scenario where the replication is set up on a pre-built index in the Source cluster and nothing on the Target cluster, CDCR will replicate the _entire_ index from the Source to Target. This functionality was added in Solr 6.2.
The active-passive nature of the initial implementation implies a "push" model from the source collection to the target collection. Therefore, the source configuration must be able to "see" the ZooKeeper ensemble in the target cluster. The ZooKeeper ensemble is provided configured in the Source's `solrconfig.xml` file.
The uni-directional nature of the initial implementation implies a "push" model from the Source collection to the Target collection. Therefore, the Source configuration must be able to "see" the ZooKeeper ensemble in the Target cluster. The ZooKeeper ensemble is provided configured in the Source's `solrconfig.xml` file.
CDCR is configured to replicate from collections in the source cluster to collections in the target cluster on a collection-by-collection basis. Since CDCR is configured in `solrconfig.xml` (on both source and target clusters), the settings can be tailored for the needs of each collection.
CDCR is configured to replicate from collections in the Source cluster to collections in the Target cluster on a collection-by-collection basis. Since CDCR is configured in `solrconfig.xml` (on both Source and Target clusters), the settings can be tailored for the needs of each collection.
CDCR can be configured to replicate from one collection to a second collection _within the same cluster_. That is a specialized scenario not covered in this document.
@ -51,14 +51,15 @@ Terms used in this document include:
[glossary]
Node:: A JVM instance running Solr; a server.
Cluster:: A set of Solr nodes managed as a single unit by a ZooKeeper ensemble, hosting one or more Collections.
Cluster:: A set of Solr nodes managed as a single unit by a ZooKeeper ensemble hosting one or more Collections.
Data Center:: A group of networked servers hosting a Solr cluster. In this document, the terms _Cluster_ and _Data Center_ are interchangeable as we assume that each Solr cluster is hosted in a different group of networked servers.
Shard:: A sub-index of a single logical collection. This may be spread across multiple nodes of the cluster. Each shard can have as many replicas as needed.
Leader:: Each shard has one node identified as its leader. All the writes for documents belonging to a shard are routed through the leader.
Shard:: A sub-index of a single logical collection. This may be spread across multiple nodes of the cluster. Each shard can have 1-N replicas.
Leader:: Each shard has replica identified as its leader. All the writes for documents belonging to a shard are routed through the leader.
Replica:: A copy of a shard for use in failover or load balancing. Replicas comprising a shard can either be leaders or non-leaders.
Follower:: A convenience term for a replica that is _not_ the leader of a shard.
Collection:: Multiple documents that make up one logical index. A cluster can have multiple collections.
Updates Log:: An append-only log of write operations maintained by each node.
Collection:: A logical index, consisting of one or more shards. A cluster can have multiple collections.
Update:: An operation that changes the collection's index in any way. This could be adding a new document, deleting documents or changing a document.
Update Log(s):: An append-only log of write operations maintained by each node.
== CDCR Architecture
@ -69,20 +70,20 @@ image::images/cross-data-center-replication-cdcr-/CDCR_arch.png[image,width=700,
Updates and deletes are first written to the Source cluster, then forwarded to the Target cluster. The data flow sequence is:
. A shard leader receives a new data update that is processed by its update processor chain.
. A shard leader receives a new update that is processed by its update processor chain.
. The data update is first applied to the local index.
. Upon successful application of the data update on the local index, the data update is added to the Updates Log queue.
. Upon successful application of the data update on the local index, the data update is added to the Update Logs queue.
. After the data update is persisted to disk, the data update is sent to the replicas within the data center.
. After Step 4 is successful, CDCR reads the data update from the Updates Log and pushes it to the corresponding collection in the target data center. This is necessary in order to ensure consistency between the Source and target data centers.
. The leader on the target data center writes the data locally and forwards it to all its followers.
. After Step 4 is successful, CDCR reads the data update from the Update Logs and pushes it to the corresponding collection in the Target data center. This is necessary in order to ensure consistency between the Source and Target data centers.
. The leader on the Target data center writes the data locally and forwards it to all its followers.
Steps 1, 2, 3 and 4 are performed synchronously by SolrCloud; Step 5 is performed asynchronously by a background thread. Given that CDCR replication is performed asynchronously, it becomes possible to push batch updates in order to minimize network communication overhead. Also, if CDCR is unable to push the update at a given time, for example, due to a degradation in connectivity, it can retry later without any impact on the source data center.
Steps 1, 2, 3 and 4 are performed synchronously by SolrCloud; Step 5 is performed asynchronously by a background thread. Given that CDCR replication is performed asynchronously, it becomes possible to push batch updates in order to minimize network communication overhead. Also, if CDCR is unable to push the update at a given time, for example, due to a degradation in connectivity, it can retry later without any impact on the Source data center.
One implication of the architecture is that the leaders in the source cluster must be able to "see" the leaders in the target cluster. Since leaders may change, this effectively means that all nodes in the source cluster must be able to "see" all Solr nodes in the target cluster so firewalls, ACL rules, etc. must be configured with care.
One implication of the architecture is that the leaders in the Source cluster must be able to "see" the leaders in the Target cluster. Since leaders may change in both Source and Target collections, which means that all nodes in the Source cluster must be able to "see" all Solr nodes in the Target cluster so firewalls, ACL rules, etc., must be configured to allow this.
The current design works most robustly if both the Source and target clusters have the same number of shards. There is no requirement that the shards in the Source and target collection have the same number of replicas.
The current design works most robustly if both the Source and Target clusters have the same number of shards. There is no requirement that the shards in the Source and Target collection have the same number of replicas.
Having different numbers of shards on the Source and target cluster is possible, but is also an "expert" configuration as that option imposes certain constraints and is not recommended. Most of the scenarios where having differing numbers of shards are contemplated are better accomplished by hosting multiple shards on each target Solr instance.
Having different numbers of shards on the Source and Target cluster is possible, but is also an "expert" configuration as that option imposes certain constraints and is not generally recommended. Most of the scenarios where having differing numbers of shards are contemplated are better accomplished by hosting multiple shards on each Solr instance.
== Major Components of CDCR
@ -90,7 +91,7 @@ There are a number of key features and components in CDCRs architecture:
=== CDCR Configuration
In order to configure CDCR, the Source data center requires the host address of the ZooKeeper cluster associated with the target data center. The ZooKeeper host address is the only information needed by CDCR to instantiate the communication with the target Solr cluster. The CDCR configuration file on the source cluster will therefore contain a list of ZooKeeper hosts. The CDCR configuration file might also contain secondary/optional configuration, such as the number of CDC Replicator threads, batch updates related settings, etc.
In order to configure CDCR, the Source data center requires the host address of the ZooKeeper cluster associated with the Target data center. The ZooKeeper host address is the only information needed by CDCR to instantiate the communication with the Target Solr cluster. The CDCR configuration section of `solrconfig.xml` file on the Source cluster will therefore contain a list of ZooKeeper hosts. The CDCR configuration section of `solrconfig.xml` might also contain secondary/optional configuration, such as the number of CDC Replicator threads, batch updates related settings, etc.
=== CDCR Initialization
@ -98,78 +99,78 @@ CDCR supports incremental updates to either new or existing collections. CDCR ma
* There is an initial bulk load of a corpus followed by lower volume incremental updates. In this case, one can do the initial bulk load and then enable CDCR. See the section <<Initial Startup>> for more information.
* The index is being built up from scratch, without a significant initial bulk load. CDCR can be set up on empty collections and keep them synchronized from the start.
* The index is always being updated at a volume too high for CDCR to keep up. This is especially possible in situations where the connection between the Source and target data centers is poor. This scenario is unsuitable for CDCR in its current form.
* The index is always being updated at a volume too high for CDCR to keep up. This is especially possible in situations where the connection between the Source and Target data centers is poor. This scenario is unsuitable for CDCR in its current form.
=== Inter-Data Center Communication
Communication between data centers will be achieved through HTTP and the Solr REST API using the SolrJ client. The SolrJ client will be instantiated with the ZooKeeper host of the target data center. SolrJ will manage the shard leader discovery process.
The CDCR REST API is the primary form of end-user communication for admin commands. A SolrJ client is used internally for CDCR operations. The SolrJ client gets its configuration information from the `solrconfig.xml` file. Users of CDCR will not interact directly with the internal SolrJ implementation and will interact with CDCR exclusively through the REST API.
=== Updates Tracking & Pushing
CDCR replicates data updates from the source to the target data center by leveraging the Updates Log.
CDCR replicates data updates from the Source to the Target data center by leveraging the Update Logs.
A background thread regularly checks the Updates Log for new entries, and then forwards them to the target data center. The thread therefore needs to keep a checkpoint in the form of a pointer to the last update successfully processed in the Updates Log. Upon acknowledgement from the target data center that updates have been successfully processed, the Updates Log pointer is updated to reflect the current checkpoint.
A background thread regularly checks the Update Logs for new entries, and then forwards them to the Target data center. The thread therefore needs to keep a checkpoint in the form of a pointer to the last update successfully processed in the Update Logs. Upon acknowledgement from the Target data center that updates have been successfully processed, the Update Logs pointer is updated to reflect the current checkpoint.
This pointer must be synchronized across all the replicas. In the case where the leader goes down and a new leader is elected, the new leader will be able to resume replication from the last update by using this synchronized pointer. The strategy to synchronize such a pointer across replicas will be explained next.
If for some reason, the target data center is offline or fails to process the updates, the thread will periodically try to contact the target data center and push the updates.
If for some reason, the Target data center is offline or fails to process the updates, the thread will periodically try to contact the Target data center and push the updates while buffering updates on the Source cluster. One implication of this is that the Source Update Logs directory should be periodically monitored as the updates will continue to accumulate amd will not be purged until the connection to the Target data center is restored.
=== Synchronization of Update Checkpoints
A reliable synchronization of the update checkpoints between the shard leader and shard replicas is critical to avoid introducing inconsistency between the Source and target data centers. Another important requirement is that the synchronization must be performed with minimal network traffic to maximize scalability.
A reliable synchronization of the update checkpoints between the shard leader and shard replicas is critical to avoid introducing inconsistency between the Source and Target data centers. Another important requirement is that the synchronization must be performed with minimal network traffic to maximize scalability.
In order to achieve this, the strategy is to:
* Uniquely identify each update operation. This unique identifier will serve as pointer.
* Rely on two storages: an ephemeral storage on the Source shard leader, and a persistent storage on the target cluster.
* Rely on two storages: an ephemeral storage on the Source shard leader, and a persistent storage on the Target cluster.
The shard leader in the source cluster will be in charge of generating a unique identifier for each update operation, and will keep a copy of the identifier of the last processed updates in memory. The identifier will be sent to the target cluster as part of the update request. On the target data center side, the shard leader will receive the update request, store it along with the unique identifier in the Updates Log, and replicate it to the other shards.
The shard leader in the Source cluster will be in charge of generating a unique identifier for each update operation, and will keep a copy of the identifier of the last processed updates in memory. The identifier will be sent to the Target cluster as part of the update request. On the Target data center side, the shard leader will receive the update request, store it along with the unique identifier in the Update Logs, and replicate it to the other shards.
SolrCloud already provides a unique identifier for each update operation, i.e., a “version” number. This version number is generated using a time-based lmport clock which is incremented for each update operation sent. This provides an “happened-before” ordering of the update operations that will be leveraged in (1) the initialization of the update checkpoint on the source cluster, and in (2) the maintenance strategy of the Updates Log.
SolrCloud already provides a unique identifier for each update operation, i.e., a “version” number. This version number is generated using a time-based lmport clock which is incremented for each update operation sent. This provides an “happened-before” ordering of the update operations that will be leveraged in (1) the initialization of the update checkpoint on the Source cluster, and in (2) the maintenance strategy of the Update Logs.
The persistent storage on the target cluster is used only during the election of a new shard leader on the Source cluster. If a shard leader goes down on the source cluster and a new leader is elected, the new leader will contact the target cluster to retrieve the last update checkpoint and instantiate its ephemeral pointer. On such a request, the target cluster will retrieve the latest identifier received across all the shards, and send it back to the source cluster. To retrieve the latest identifier, every shard leader will look up the identifier of the first entry in its Update Logs and send it back to a coordinator. The coordinator will have to select the highest among them.
The persistent storage on the Target cluster is used only during the election of a new shard leader on the Source cluster. If a shard leader goes down on the Source cluster and a new leader is elected, the new leader will contact the Target cluster to retrieve the last update checkpoint and instantiate its ephemeral pointer. On such a request, the Target cluster will retrieve the latest identifier received across all the shards, and send it back to the Source cluster. To retrieve the latest identifier, every shard leader will look up the identifier of the first entry in its Update Logs and send it back to a coordinator. The coordinator will have to select the highest among them.
This strategy does not require any additional network traffic and ensures reliable pointer synchronization. Consistency is principally achieved by leveraging SolrCloud. The update workflow of SolrCloud ensures that every update is applied to the leader but also to any of the replicas. If the leader goes down, a new leader is elected. During the leader election, a synchronization is performed between the new leader and the other replicas. As a result, this ensures that the new leader has a consistent Update Logs with the previous leader. Having a consistent Updates Log means that:
This strategy does not require any additional network traffic and ensures reliable pointer synchronization. Consistency is principally achieved by leveraging SolrCloud. The update workflow of SolrCloud ensures that every update is applied to the leader and also to any of the replicas. If the leader goes down, a new leader is elected. During the leader election, a synchronization is performed between the new leader and the other replicas. This ensures that the new leader has a consistent Update Logs with the previous leader. Having a consistent Update Logs means that:
* On the source cluster, the update checkpoint can be reused by the new leader.
* On the target cluster, the update checkpoint will be consistent between the previous and new leader. This ensures the correctness of the update checkpoint sent by a newly elected leader from the target cluster.
* On the Source cluster, the update checkpoint can be reused by the new leader.
* On the Target cluster, the update checkpoint will be consistent between the previous and new leader. This ensures the correctness of the update checkpoint sent by a newly elected leader from the Target cluster.
=== Maintenance of Updates Log
=== Maintenance of Update Logs
The CDCR replication logic requires modification to the maintenance logic of the Updates Log on the source data center. Initially, the Updates Log acts as a fixed size queue, limited to 100 update entries. In the CDCR scenario, the Update Logs must act as a queue of variable size as they need to keep track of all the updates up through the last processed update by the target data center. Entries in the Update Logs are removed only when all pointers (one pointer per target data center) are after them.
The CDCR replication logic requires modification to the maintenance logic of the Update Logs on the Source data center. Initially, the Update Logs acts as a fixed size queue, limited to 100 update entries by default. In the CDCR scenario, the Update Logs must act as a queue of variable size as they need to keep track of all the updates up through the last processed update by the Target data center. Entries in the Update Logs are removed only when all pointers (one pointer per Target data center) are after them.
If the communication with one of the target data center is slow, the Updates Log on the source data center can grow to a substantial size. In such a scenario, it is necessary for the Updates Log to be able to efficiently find a given update operation given its identifier. Given that its identifier is an incremental number, it is possible to implement an efficient search strategy. Each transaction log file contains as part of its filename the version number of the first element. This is used to quickly traverse all the transaction log files and find the transaction log file containing one specific version number.
If the communication with one of the Target data center is slow, the Update Logs on the Source data center can grow to a substantial size. In such a scenario, it is necessary for the Update Logs to be able to efficiently find a given update operation given its identifier. Given that its identifier is an incremental number, it is possible to implement an efficient search strategy. Each transaction log file contains as part of its filename the version number of the first element. This is used to quickly traverse all the transaction log files and find the transaction log file containing one specific version number.
=== Monitoring
CDCR provides the following monitoring capabilities over the replication operations:
* Monitoring of the outgoing and incoming replications, with information such as the Source and target nodes, their status, etc.
* Monitoring of the outgoing and incoming replications, with information such as the Source and Target nodes, their status, etc.
* Statistics about the replication, with information such as operations (add/delete) per second, number of documents in the queue, etc.
Information about the lifecycle and statistics will be provided on a per-shard basis by the CDC Replicator thread. The CDCR API can then aggregate this information an a collection level.
=== CDC Replicator
The CDC Replicator is a background thread that is responsible for replicating updates from a Source data center to one or more target data centers. It is responsible in providing monitoring information on a per-shard basis. As there can be a large number of collections and shards in a cluster, we will use a fixed-size pool of CDC Replicator threads that will be shared across shards.
The CDC Replicator is a background thread that is responsible for replicating updates from a Source data center to one or more Target data centers. It is responsible in providing monitoring information on a per-shard basis. As there can be a large number of collections and shards in a cluster, we will use a fixed-size pool of CDC Replicator threads that will be shared across shards.
=== CDCR Limitations
The current design of CDCR has some limitations. CDCR will continue to evolve over time and many of these limitations will be addressed. Among them are:
* CDCR is unlikely to be satisfactory for bulk-load situations where the update rate is high, especially if the bandwidth between the Source and Target clusters is restricted. In this scenario, the initial bulk load should be performed, the Source and Target data centers synchronized and CDCR be utilized for incremental updates.
* CDCR is currently only active-passive; data is pushed from the Source cluster to the Target cluster. There is active work being done in this area in the 6x code line to remove this limitation.
* CDCR is currently only uni-directional; data is pushed from the Source cluster to the Target cluster. There is active work being done in this area to remove this limitation.
* CDCR works most robustly with the same number of shards in the Source and Target collection. The shards in the two collections may have different numbers of replicas.
* Running CDCR with the indexes on HDFS is not currently supported, see the https://issues.apache.org/jira/browse/SOLR-9861[Solr CDCR over HDFS] JIRA issue.
* Configuration files (solrconfig.xml, schema etc.) are not automatically synchronized between the Source and Target clusters. This means that when the Source schema or solrconfig files are changed, those changes must be replicated manually to the Target cluster. This includes adding fields by the <<schema-api.adoc#schema-api,Schema API>> or <<managed-resources.adoc#managed-resources,Managed Resources>> as well as hand editing those files.
* Configuration files `(solrconfig.xml, schema etc.)` are not automatically synchronized between the Source and Target clusters. This means that when the Source schema or `solrconfig.xml` files are changed, those changes must be replicated manually to the Target cluster. This includes adding fields by the <<schema-api.adoc#schema-api,Schema API>> or <<managed-resources.adoc#managed-resources,Managed Resources>> as well as hand editing those files.
== CDCR Configuration
The source and target configurations differ in the case of the data centers being in separate clusters. "Cluster" here means separate ZooKeeper ensembles controlling disjoint Solr instances. Whether these data centers are physically separated or not is immaterial for this discussion.
The Source and Target configurations differ in the case of the data centers being in separate clusters. "Cluster" here means separate ZooKeeper ensembles controlling disjoint Solr instances. Whether these data centers are physically separated or not is immaterial for this discussion.
=== Source Configuration
Here is a sample of a source configuration file, a section in `solrconfig.xml`. The presence of the <replica> section causes CDCR to use this cluster as the Source and should not be present in the target collections in the cluster-to-cluster case. Details about each setting are after the two examples:
Here is a sample of a Source configuration file, a section in `solrconfig.xml`. The presence of the <replica> section causes CDCR to use this cluster as the Source and should not be present in the Target collections. Details about each setting are after the two examples:
[source,xml]
----
@ -207,9 +208,9 @@ Here is a sample of a source configuration file, a section in `solrconfig.xml`.
=== Target Configuration
Here is a typical target configuration.
Here is a typical Target configuration.
Target instance must configure an update processor chain that is specific to CDCR. The update processor chain must include the *CdcrUpdateProcessorFactory*. The task of this processor is to ensure that the version numbers attached to update requests coming from a CDCR source SolrCloud are reused and not overwritten by the target. A properly configured Target configuration looks similar to this.
Target instance must configure an update processor chain that is specific to CDCR. The update processor chain must include the *CdcrUpdateProcessorFactory*. The task of this processor is to ensure that the version numbers attached to update requests coming from a CDCR Source SolrCloud are reused and not overwritten by the Target. A properly configured Target configuration looks similar to this.
[source,xml]
----
@ -246,20 +247,20 @@ The configuration details, defaults and options are as follows:
==== The Replica Element
CDCR can be configured to forward update requests to one or more replicas. A replica is defined with a “replica” list as follows:
CDCR can be configured to forward update requests to one or more Target collections. A Target collection is defined with a “replica” list as follows:
`zkHost`::
The host address for ZooKeeper of the target SolrCloud. Usually this is a comma-separated list of addresses to each node in the target ZooKeeper ensemble. This parameter is required.
The host address for ZooKeeper of the Target SolrCloud. Usually this is a comma-separated list of addresses to each node in the Target ZooKeeper ensemble. This parameter is required.
`Source`::
The name of the collection on the Source SolrCloud to be replicated. This parameter is required.
`Target`::
The name of the collection on the target SolrCloud to which updates will be forwarded. This parameter is required.
The name of the collection on the Target SolrCloud to which updates will be forwarded. This parameter is required.
==== The Replicator Element
The CDC Replicator is the component in charge of forwarding updates to the replicas. The replicator will monitor the update logs of the Source collection and will forward any new updates to the target collection.
The CDC Replicator is the component in charge of forwarding updates to the replicas. The replicator will monitor the update logs of the Source collection and will forward any new updates to the Target collection.
The replicator uses a fixed thread pool to forward updates to multiple replicas in parallel. If more than one replica is configured, one thread will forward a batch of updates from one replica at a time in a round-robin fashion. The replicator can be configured with a “replicator” list as follows:
@ -277,20 +278,34 @@ The number of updates to send in one batch. The optimal size depends on the size
Expert: Non-leader nodes need to synchronize their update logs with their leader node from time to time in order to clean deprecated transaction log files. By default, such a synchronization process is performed every minute. The schedule of the synchronization can be modified with a “updateLogSynchronizer” list as follows:
`schedule`::
The delay in milliseconds for synchronizing the updates log. The default is `60000`.
The delay in milliseconds for synchronizing the update logs. The default is `60000`.
==== The Buffer Element
CDCR is configured by default to buffer any new incoming updates. When buffering updates, the updates log will store all the updates indefinitely. Replicas do not need to buffer updates, and it is recommended to disable buffer on the target SolrCloud. The buffer can be disabled at startup with a “buffer” list and the parameter “defaultState” as follows:
When buffering updates, the update logs will store all the updates indefinitely. It is recommended to disable buffering on both the Source and Target clusters during normal operation as when buffering is enabled the Update Logs will grow without limit. Leaving buffering enabled is intended for special maintenance periods. The buffer can be disabled at startup with a “buffer” list and the parameter “defaultState” as follows:
`defaultState`::
The state of the buffer at startup. The default is `enabled`.
[TIP]
.Buffering is should be enabled only for maintenance windows
====
Buffering is designed to augment maintenance windows. The following points should be kept in mind:
* When buffering is enabled, the Update Logs will grow without limit; they will never be purged.
* During normal operation, the Update Logs will automatically accrue on the Source data center if the Target data center is unavailable; It is not necessary to enable buffering for CDCR to handle routine network disruptions.
** For this reason, monitoring disk usage on the Source data center is recommended as an additional check that the Target data center is receiving updates.
* Buffering should _not_ be enabled on the Target data center as Update Logs would accrue without limit.
* If buffering is enabled then disabled, the Update Logs will be removed when their contents have been sent to the Target data center. This process may take some time.
** Update Log cleanup is not triggered until a new update is sent to the Source data center.
====
== CDCR API
The CDCR API is used to control and monitor the replication process. Control actions are performed at a collection level, i.e., by using the following base URL for API calls: `\http://localhost:8983/solr/<collection>`.
The CDCR API is used to control and monitor the replication process. Control actions are performed at a collection level, i.e., by using the following base URL for API calls: `\http://localhost:8983/solr/<collection>/cdcr`.
Monitor actions are performed at a core level, i.e., by using the following base URL for API calls: `\http://localhost:8983/solr/<collection>`.
Monitor actions are performed at a core level, i.e., by using the following base URL for API calls: `\http://localhost:8983/solr/<core>/cdcr`.
Currently, none of the CDCR API calls have parameters.
@ -482,9 +497,9 @@ The status of CDCR, including the confirmation that CDCR is stopped.
*Output Content*
The output is composed of a list “queues” which contains a list of (ZooKeeper) target hosts, themselves containing a list of target collections. For each collection, the current size of the queue and the timestamp of the last update operation successfully processed is provided. The timestamp of the update operation is the original timestamp, i.e., the time this operation was processed on the Source SolrCloud. This allows an estimate the latency of the replication process.
The output is composed of a list “queues” which contains a list of (ZooKeeper) Target hosts, themselves containing a list of Target collections. For each collection, the current size of the queue and the timestamp of the last update operation successfully processed is provided. The timestamp of the update operation is the original timestamp, i.e., the time this operation was processed on the Source SolrCloud. This allows an estimate the latency of the replication process.
The “queues” object also contains information about the updates log, such as the size (in bytes) of the updates log on disk (“tlogTotalSize”), the number of transaction log files (“tlogTotalCount”) and the status of the updates log synchronizer (“updateLogSynchronizer”).
The “queues” object also contains information about the update logs, such as the size (in bytes) of the update logs on disk (“tlogTotalSize”), the number of transaction log files (“tlogTotalCount”) and the status of the update logs synchronizer (“updateLogSynchronizer”).
===== QUEUES Examples
@ -524,7 +539,7 @@ The “queues” object also contains information about the updates log, such as
===== OPS Response
The output is composed of `operationsPerSecond` which contains a list of (ZooKeeper) target hosts, themselves containing a list of target collections. For each collection, the average number of processed operations per second since the start of the replication process is provided. The operations are further broken down into two groups: add and delete operations.
The output is composed of `operationsPerSecond` which contains a list of (ZooKeeper) target hosts, themselves containing a list of Target collections. For each collection, the average number of processed operations per second since the start of the replication process is provided. The operations are further broken down into two groups: add and delete operations.
===== OPS Examples
@ -562,7 +577,7 @@ The output is composed of `operationsPerSecond` which contains a list of (ZooKee
===== ERRORS Response
The output is composed of a list “errors” which contains a list of (ZooKeeper) target hosts, themselves containing a list of target collections. For each collection, information about errors encountered during the replication is provided, such as the number of consecutive errors encountered by the replicator thread, the number of bad requests or internal errors since the start of the replication process, and a list of the last errors encountered ordered by timestamp.
The output is composed of a list “errors” which contains a list of (ZooKeeper) target hosts, themselves containing a list of Target collections. For each collection, information about errors encountered during the replication is provided, such as the number of consecutive errors encountered by the replicator thread, the number of bad requests or internal errors since the start of the replication process, and a list of the last errors encountered ordered by timestamp.
===== ERRORS Examples
@ -601,11 +616,18 @@ The output is composed of a list “errors” which contains a list of (ZooKeepe
== Initial Startup
.CDCR Bootstrapping
[TIP]
====
Solr 6.2, added the additional functionality to allow CDCR to replicate the entire index from the Source to the Target data centers on first time startup as an alternative to the following procedure. For very large indexes, time should be allocated for this initial synchronization if this option is chosen.
====
This is a general approach for initializing CDCR in a production environment based upon an approach taken by the initial working installation of CDCR and generously contributed to illustrate a "real world" scenario.
* Customer uses the CDCR approach to keep a remote disaster-recovery instance available for production backup. This is an active-passive solution.
* Customer uses the CDCR approach to keep a remote disaster-recovery instance available for production backup. This is a uni-directional solution.
* Customer has 26 clouds with 200 million assets per cloud (15GB indexes). Total document count is over 4.8 billion.
** Source and target clouds were synched in 2-3 hour maintenance windows to establish the base index for the targets.
** Source and Target clouds were synched in 2-3 hour maintenance windows to establish the base index for the Targets.
As usual, it is good to start small. Sync a single cloud and monitor for a period of time before doing the others. You may need to adjust your settings several times before finding the right balance.
@ -638,7 +660,7 @@ As usual, it is good to start small. Sync a single cloud and monitor for a perio
----
+
* Upload the modified `solrconfig.xml` to ZooKeeper on both Source and Target
* Sync the index directories from the Source collection to target collection across to the corresponding shard nodes. `rsync` works well for this.
* Sync the index directories from the Source collection to Target collection across to the corresponding shard nodes. `rsync` works well for this.
+
For example, if there are 2 shards on collection1 with 2 replicas for each shard, copy the corresponding index directories from
+
@ -660,7 +682,7 @@ For example, if there are 2 shards on collection1 with 2 replicas for each shard
http://host:port/solr/<collection_name>/cdcr?action=START
+
* There is no need to run the /cdcr?action=START command on the Target
* Disable the buffer on the Target
* Disable the buffer on the Target and Source
+
[source,text]
http://host:port/solr/collection_name/cdcr?action=DISABLEBUFFER
@ -677,7 +699,7 @@ http://host:port/solr/collection_name/cdcr?action=DISABLEBUFFER
== ZooKeeper Settings
With CDCR, the target ZooKeepers will have connections from the Target clouds and the Source clouds. You may need to increase the `maxClientCnxns` setting in `zoo.cfg`.
With CDCR, the Target ZooKeepers will have connections from the Target clouds and the Source clouds. You may need to increase the `maxClientCnxns` setting in `zoo.cfg`.
[source,text]
----

View File

@ -1,7 +1,7 @@
= Getting Started
:page-shortname: getting-started
:page-permalink: getting-started.html
:page-children: installing-solr, running-solr, a-quick-overview, a-step-closer, solr-control-script-reference
:page-children: a-quick-overview, solr-system-requirements, installing-solr, solr-configuration-files, solr-upgrade-notes, taking-solr-to-production, upgrading-a-solr-cluster
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
@ -19,25 +19,21 @@
// specific language governing permissions and limitations
// under the License.
Solr makes it easy for programmers to develop sophisticated, high-performance search applications with advanced features such as faceting (arranging search results in columns with numerical counts of key terms).
[.lead]
Solr makes it easy for programmers to develop sophisticated, high-performance search applications with advanced features.
Solr builds on another open source search technology: Lucene, a Java library that provides indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. Both Solr and Lucene are managed by the Apache Software Foundation (http://www.apache.org/[www.apache.org)].
The Lucene search library currently ranks among the top 15 open source projects and is one of the top 5 Apache projects, with installations at over 4,000 companies. Lucene/Solr downloads have grown nearly ten times over the past three years, with a current run-rate of over 6,000 downloads a day. The Solr search server, which provides application builders a ready-to-use search platform on top of the Lucene search library, is the fastest growing Lucene sub-project. Apache Lucene/Solr offers an attractive alternative to the proprietary licensed search and discovery software vendors.
This section helps you get Solr up and running quickly, and introduces you to the basic Solr architecture and features. It covers the following topics:
<<installing-solr.adoc#installing-solr,Installing Solr>>: A walkthrough of the Solr installation process.
<<running-solr.adoc#running-solr,Running Solr>>: An introduction to running Solr. Includes information on starting up the servers, adding documents, and running queries.
This section introduces you to the basic Solr architecture and features to help you get up and running quickly. It covers the following topics:
<<a-quick-overview.adoc#a-quick-overview,A Quick Overview>>: A high-level overview of how Solr works.
<<a-step-closer.adoc#a-step-closer,A Step Closer>>: An introduction to Solr's home directory and configuration options.
<<installing-solr.adoc#installing-solr,Installing Solr>>: A walkthrough of the Solr installation process.
<<solr-control-script-reference.adoc#solr-control-script-reference,Solr Control Script Reference>>: a complete reference of all of the commands and options available with the bin/solr script.
<<solr-configuration-files.adoc#solr-configuration-files,Solr Configuration Files>>: Overview of the installation layout and major configuration files.
[TIP]
====
Solr includes a Quick Start tutorial which will be helpful if you are just starting out with Solr. You can find it online at http://lucene.apache.org/solr/quickstart.html.
====
<<solr-upgrade-notes.adoc#solr-upgrade-notes,Solr Upgrade Notes>>: Information about changes made in Solr releases.
<<taking-solr-to-production.adoc#taking-solr-to-production,Taking Solr to Production>>: Detailed steps to help you install Solr as a service and take your application to production.
<<upgrading-a-solr-cluster.adoc#upgrading-a-solr-cluster,Upgrading a Solr Cluster>>: Information for upgrading a production SolrCloud cluster.
TIP: Solr includes a Quick Start tutorial which will be helpful if you are just starting out with Solr. You can find it in this Guide at <<solr-tutorial.adoc#solr-tutorial,Solr Tutorial>>.

Binary file not shown.

After

Width:  |  Height:  |  Size: 120 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 250 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 224 KiB

View File

@ -1,7 +1,7 @@
= Apache Solr Reference Guide
:page-shortname: index
:page-permalink: index.html
:page-children: about-this-guide, getting-started, upgrading-solr, using-the-solr-administration-user-interface, documents-fields-and-schema-design, understanding-analyzers-tokenizers-and-filters, indexing-and-basic-data-operations, searching, the-well-configured-solr-instance, managing-solr, solrcloud, legacy-scaling-and-distribution, client-apis, major-changes-from-solr-5-to-solr-6, upgrading-a-solr-cluster, further-assistance, solr-glossary, errata, how-to-contribute
:page-children: about-this-guide, solr-tutorial, getting-started, solr-control-script-reference, using-the-solr-administration-user-interface, documents-fields-and-schema-design, understanding-analyzers-tokenizers-and-filters, indexing-and-basic-data-operations, searching, the-well-configured-solr-instance, managing-solr, solrcloud, legacy-scaling-and-distribution, client-apis, major-changes-from-solr-5-to-solr-6, further-assistance, solr-glossary, errata, how-to-contribute
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
@ -19,12 +19,17 @@
// specific language governing permissions and limitations
// under the License.
This reference guide describes Apache Solr, the open source solution for search. You can download Apache Solr from the Solr website at http://lucene.apache.org/solr/.
[.lead]
This reference guide describes Apache Solr, the open source solution for search.
This Guide contains the following sections:
Solr builds on Lucene, an open source Java library that provides indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. Both Solr and Lucene are managed by the Apache Software Foundation (http://www.apache.org/[www.apache.org)]. You can download Apache Solr from the Solr website at http://lucene.apache.org/solr/.
This Guide contains the following main sections:
*<<getting-started.adoc#getting-started,Getting Started>>*: This section guides you through the installation and setup of Solr.
*<<solr-control-script-reference#solr-control-script-reference,Solr Control Script Reference>>*: This section provides information about all of the options available to the `bin/solr` / `bin\solr.cmd` scripts, which can start and stop Solr, configure authentication, and create or remove collections and cores.
*<<using-the-solr-administration-user-interface.adoc#using-the-solr-administration-user-interface,Using the Solr Administration User Interface>>*: This section introduces the Solr Web-based user interface. From your browser you can view configuration files, submit queries, view logfile settings and Java environment settings, and monitor and control distributed configurations.
*<<documents-fields-and-schema-design.adoc#documents-fields-and-schema-design,Documents, Fields, and Schema Design>>*: This section describes how Solr organizes its data for indexing. It explains how a Solr schema defines the fields and field types which Solr uses to organize data within the document files it indexes.

View File

@ -1,6 +1,7 @@
= Installing Solr
:page-shortname: installing-solr
:page-permalink: installing-solr.html
:page-toclevels: 1
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
@ -18,39 +19,178 @@
// specific language governing permissions and limitations
// under the License.
This section describes how to install Solr.
Installation of Solr on Unix-compatible or Windows servers generally requires simply extracting (or, unzipping) the download package.
You can install Solr in any system where a suitable Java Runtime Environment (JRE) is available, as detailed below. Currently this includes Linux, OS X, and Microsoft Windows. The instructions in this section should work for any platform, with a few exceptions for Windows as noted.
Please be sure to review the <<solr-system-requirements.adoc#solr-system-requirements,Solr System Requirements>> before starting Solr.
== Got Java?
== Available Solr Packages
You will need the Java Runtime Environment (JRE) version 1.8 or higher. At a command line, check your Java version like this:
Solr is available from the Solr website. Download the latest release https://lucene.apache.org/solr/mirrors-solr-latest-redir.html.
[source,plain]
----
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
----
There are three separate packages:
The exact output will vary, but you need to make sure you meet the minimum version requirement. We also recommend choosing a version that is not end-of-life from its vendor. If you don't have the required version, or if the java command is not found, download and install the latest version from Oracle at http://www.oracle.com/technetwork/java/javase/downloads/index.html.
* `solr-{solr-docs-version}.0.tgz` for Linux/Unix/OSX systems
* `solr-{solr-docs-version}.0.zip` for Microsoft Windows systems
* `solr-{solr-docs-version}.0-src.tgz` the package Solr source code. This is useful if you want to develop on Solr without using the official Git repository.
[[install-command]]
== Installing Solr
== Preparing for Installation
Solr is available from the Solr website at http://lucene.apache.org/solr/.
When getting started with Solr, all you need to do is extract the Solr distribution archive to a directory of your choosing. This will suffice as an initial development environment, but take care not to overtax this "toy" installation before setting up your true development and production environments.
For Linux/Unix/OSX systems, download the `.tgz` file. For Microsoft Windows systems, download the `.zip` file.
When you've progressed past initial evaluation of Solr, you'll want to take care to plan your implementation. You may need to reinstall Solr on another server or make a clustered SolrCloud environment.
When getting started, all you need to do is extract the Solr distribution archive to a directory of your choosing. When you're ready to setup Solr for a production environment, please refer to the instructions provided on the <<taking-solr-to-production.adoc#taking-solr-to-production,Taking Solr to Production>> page.
When you're ready to setup Solr for a production environment, please refer to the instructions provided on the <<taking-solr-to-production.adoc#taking-solr-to-production,Taking Solr to Production>> page.
.What Size Server Do I Need?
[NOTE]
====
How to size your Solr installation is a complex question that relies on a number of factors, including the number and structure of documents, how many fields you intend to store, the number of users, etc.
It's highly recommended that you spend a bit of time thinking about the factors that will impact hardware sizing for your Solr implementation. A very good blog post that discusses the issues to consider is https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/[Sizing Hardware in the Abstract: Why We Don't have a Definitive Answer].
====
== Package Installation
To keep things simple for now, extract the Solr distribution archive to your local home directory, for instance on Linux, do:
[source,bash]
[source,bash,subs="attributes"]
----
cd ~/
tar zxf solr-x.y.z.tgz
tar zxf solr-{solr-docs-version}.0.tgz
----
Once extracted, you are now ready to run Solr using the instructions provided in the <<running-solr.adoc#running-solr,Running Solr>> section.
Once extracted, you are now ready to run Solr using the instructions provided in the <<Starting Solr>> section below.
== Directory Layout
After installing Solr, you'll see the following directories and files within them:
bin::
This directory includes several important scripts that will make using Solr easier.
solr and solr.cmd::: This is <<solr-control-script-reference.adoc#solr-control-script-reference,Solr's Control Script>>, also known as `bin/solr` (*nix) / `bin/solr.cmd` (Windows). This script is the preferred tool to start and stop Solr. You can also create collections or cores, configure authentication, and work with configuration files when running in SolrCloud mode.
post::: The <<post-tool.adoc#post-tool,PostTool>>, which provides a simple command line interface for POSTing content to Solr.
solr.in.sh and solr.in.cmd:::
These are property files for *nix and Windows systems, respectively. System-level properties for Java, Jetty, and Solr are configured here. Many of these settings can be overridden when using `bin/solr` / `bin/solr.cmd`, but this allows you to set all the properties in one place.
install_solr_services.sh:::
This script is used on *nix systems to install Solr as a service. It is described in more detail in the section <<taking-solr-to-production.adoc#taking-solr-to-production,Taking Solr to Production>>.
contrib::
Solr's `contrib` directory includes add-on plugins for specialized features of Solr.
dist::
The `dist` directory contains the main Solr .jar files.
docs::
The `docs` directory includes a link to online Javadocs for Solr.
example::
The `example` directory includes several types of examples that demonstrate various Solr capabilities. See the section <<Solr Examples>> below for more details on what is in this directory.
licenses::
The `licenses` directory includes all of the licenses for 3rd party libraries used by Solr.
server::
This directory is where the heart of the Solr application resides. A README in this directory provides a detailed overview, but here are some highlights:
* Solr's Admin UI (`server/solr-webapp`)
* Jetty libraries (`server/lib`)
* Log files (`server/logs`) and log configurations (`server/resources`). See the section <<configuring-logging.adoc#configuring-logging,Configuring Logging>> for more details on how to customize Solr's default logging.
* Sample configsets (`server/solr/configsets`)
== Solr Examples
Solr includes a number of example documents and configurations to use when getting started. If you ran through the <<solr-tutorial.adoc#solr-tutorial,Solr Tutorial>>, you have already interacted with some of these files.
Here are the examples included with Solr:
exampledocs::
This is a small set of simple CSV, XML, and JSON files that can be used with `bin/post` when first getting started with Solr. For more information about using `bin/post` with these files, see <<post-tool.adoc#post-tool,Post Tool>>.
example-DIH::
This directory includes a few example DataImport Handler (DIH) configurations to help you get started with importing structured content in a database, an email server, or even an Atom feed. Each example will index a different set of data; see the README there for more details about these examples.
files::
The `files` directory provides a basic search UI for documents such as Word or PDF that you may have stored locally. See the README there for details on how to use this example.
films::
The `films` directory includes a robust set of data about movies in three formats: CSV, XML, and JSON. See the README there for details on how to use this dataset.
== Starting Solr
Solr includes a command line interface tool called `bin/solr` (Linux/MacOS) or `bin\solr.cmd` (Windows). This tool allows you to start and stop Solr, create cores and collections, configure authentication, and check the status of your system.
To use it to start Solr you can simply enter:
[source,bash]
----
bin/solr start
----
If you are running Windows, you can start Solr by running `bin\solr.cmd` instead.
[source,plain]
----
bin\solr.cmd start
----
This will start Solr in the background, listening on port 8983.
When you start Solr in the background, the script will wait to make sure Solr starts correctly before returning to the command line prompt.
TIP: All of the options for the Solr CLI are described in the section <<solr-control-script-reference.adoc#solr-control-script-reference,Solr Control Script Reference>>.
=== Start Solr with a Specific Bundled Example
Solr also provides a number of useful examples to help you learn about key features. You can launch the examples using the `-e` flag. For instance, to launch the "techproducts" example, you would do:
[source,bash]
----
bin/solr -e techproducts
----
Currently, the available examples you can run are: techproducts, dih, schemaless, and cloud. See the section <<solr-control-script-reference.adoc#running-with-example-configurations,Running with Example Configurations>> for details on each example.
.Getting Started with SolrCloud
NOTE: Running the `cloud` example starts Solr in <<solrcloud.adoc#solrcloud,SolrCloud>> mode. For more information on starting Solr in cloud mode, see the section <<getting-started-with-solrcloud.adoc#getting-started-with-solrcloud,Getting Started with SolrCloud>>.
=== Check if Solr is Running
If you're not sure if Solr is running locally, you can use the status command:
[source,bash]
----
bin/solr status
----
This will search for running Solr instances on your computer and then gather basic information about them, such as the version and memory usage.
That's it! Solr is running. If you need convincing, use a Web browser to see the Admin Console.
`\http://localhost:8983/solr/`
.The Solr Admin interface.
image::images/running-solr/SolrAdminDashboard.png[image,width=900,height=456]
If Solr is not running, your browser will complain that it cannot connect to the server. Check your port number and try again.
=== Create a Core
If you did not start Solr with an example configuration, you would need to create a core in order to be able to index and search. You can do so by running:
[source,bash]
----
bin/solr create -c <name>
----
This will create a core that uses a data-driven schema which tries to guess the correct field type when you add documents to the index.
To see all available options for creating a new core, execute:
[source,bash]
----
bin/solr create -help
----

View File

@ -18,7 +18,9 @@
// specific language governing permissions and limitations
// under the License.
There are some major changes in Solr 6 to consider before starting to migrate your configurations and indexes. There are many hundreds of changes, so a thorough review of the <<upgrading-solr.adoc#upgrading-solr,Upgrading Solr>> section as well as the {solr-javadocs}/changes/Changes.html[CHANGES.txt] file in your Solr instance will help you plan your migration to Solr 6. This section attempts to highlight some of the major changes you should be aware of.
There are some major changes in Solr 6 to consider before starting to migrate your configurations and indexes.
There are many hundreds of changes, so a thorough review of the <<solr-upgrade-notes.adoc#solr-upgrade-notes,Solr Upgrade Notes>> section as well as the {solr-javadocs}/changes/Changes.html[CHANGES.txt] file in your Solr instance will help you plan your migration to Solr 6. This section attempts to highlight some of the major changes you should be aware of.
== Highlights of New Features in Solr 6

View File

@ -1,7 +1,7 @@
= Managing Solr
:page-shortname: managing-solr
:page-permalink: managing-solr.html
:page-children: taking-solr-to-production, securing-solr, running-solr-on-hdfs, making-and-restoring-backups, configuring-logging, using-jmx-with-solr, mbean-request-handler, performance-statistics-reference, metrics-reporting, v2-api
:page-children: securing-solr, running-solr-on-hdfs, making-and-restoring-backups, configuring-logging, using-jmx-with-solr, mbean-request-handler, performance-statistics-reference, metrics-reporting, v2-api
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
@ -21,8 +21,6 @@
This section describes how to run Solr and how to look at Solr when it is running. It contains the following sections:
<<taking-solr-to-production.adoc#taking-solr-to-production,Taking Solr to Production>>: Describes how to install Solr as a service on Linux for production environments.
<<securing-solr.adoc#securing-solr,Securing Solr>>: How to use the Basic and Kerberos authentication and rule-based authorization plugins for Solr, and how to enable SSL.
<<running-solr-on-hdfs.adoc#running-solr-on-hdfs,Running Solr on HDFS>>: How to use HDFS to store your Solr indexes and transaction logs.

View File

@ -339,6 +339,15 @@ prefix:: The first characters of metric name that will filter the metrics return
property:: Allows requesting only this metric from any compound metric. Multiple `property` parameters can be combined to act as an OR request. For example, to only get the 99th and 999th percentile values from all metric types and groups, you can add `&property=p99_ms&property=p999_ms` to your request. This can be combined with `group`, `type`, and `prefix` as necessary.
key:: fully-qualified metric name, which specifies one concrete metric instance (parameter can be
specified multiple times to retrieve multiple concrete metrics). *NOTE: when this parameter is used other
selection methods listed above are ignored.* Fully-qualified name consists of registry name, colon and
metric name, with optional colon and metric property. Colons in names can be escaped using back-slash `\`
character. Examples:
`key=solr.node:CONTAINER.fs.totalSpace`
`key=solr.core.collection1:QUERY./select.requestTimes:max_ms`
`key=solr.jvm:system.properties:user.name`
compact:: When false, a more verbose format of the response will be returned. Instead of a response like this:
+
[source,json]
@ -395,3 +404,7 @@ Request only "counter" type metrics in the "core" group, returned in JSON:
Request only "core" group metrics that start with "INDEX", returned in XML:
`\http://localhost:8983/solr/admin/metrics?wt=xml&prefix=INDEX&group=core`
Request only "user.name" property of "system.properties" metric from registry "solr.jvm":
`\http://localhost:8983/solr/admin/metrics?wt=xml?key=solr.jvm:system.properties:user.name`

View File

@ -1,289 +0,0 @@
= Running Solr
:page-shortname: running-solr
:page-permalink: running-solr.html
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
This section describes how to run Solr with an example schema, how to add documents, and how to run queries.
[[RunningSolr-StarttheServer]]
== Start the Server
If you didn't start Solr after installing it, you can start it by running `bin/solr` from the Solr directory.
[source,bash]
----
bin/solr start
----
If you are running Windows, you can start Solr by running `bin\solr.cmd` instead.
[source,plain]
----
bin\solr.cmd start
----
This will start Solr in the background, listening on port 8983.
When you start Solr in the background, the script will wait to make sure Solr starts correctly before returning to the command line prompt.
The `bin/solr` and `bin\solr.cmd` scripts allow you to customize how you start Solr. Let's work through a few examples of using the `bin/solr` script (if you're running Solr on Windows, the `bin\solr.cmd` works the same as what is shown in the examples below):
[[RunningSolr-SolrScriptOptions]]
=== Solr Script Options
The `bin/solr` script has several options.
[[RunningSolr-ScriptHelp]]
==== Script Help
To see how to use the `bin/solr` script, execute:
[source,bash]
----
bin/solr -help
----
For specific usage instructions for the *start* command, do:
[source,bash]
----
bin/solr start -help
----
[[RunningSolr-StartSolrintheForeground]]
==== Start Solr in the Foreground
Since Solr is a server, it is more common to run it in the background, especially on Unix/Linux. However, to start Solr in the foreground, simply do:
[source,bash]
----
bin/solr start -f
----
If you are running Windows, you can run:
[source,plain]
----
bin\solr.cmd start -f
----
[[RunningSolr-StartSolrwithaDifferentPort]]
==== Start Solr with a Different Port
To change the port Solr listens on, you can use the `-p` parameter when starting, such as:
[source,bash]
----
bin/solr start -p 8984
----
[[RunningSolr-StopSolr]]
==== Stop Solr
When running Solr in the foreground (using -f), then you can stop it using `Ctrl-c`. However, when running in the background, you should use the *stop* command, such as:
[source,bash]
----
bin/solr stop -p 8983
----
The stop command requires you to specify the port Solr is listening on or you can use the `-all` parameter to stop all running Solr instances.
[[RunningSolr-StartSolrwithaSpecificBundledExample]]
==== Start Solr with a Specific Bundled Example
Solr also provides a number of useful examples to help you learn about key features. You can launch the examples using the `-e` flag. For instance, to launch the "techproducts" example, you would do:
[source,bash]
----
bin/solr -e techproducts
----
Currently, the available examples you can run are: techproducts, dih, schemaless, and cloud. See the section <<solr-control-script-reference.adoc#running-with-example-configurations,Running with Example Configurations>> for details on each example.
.Getting Started with SolrCloud
[NOTE]
====
Running the `cloud` example starts Solr in <<solrcloud.adoc#solrcloud,SolrCloud>> mode. For more information on starting Solr in cloud mode, see the section <<getting-started-with-solrcloud.adoc#getting-started-with-solrcloud,Getting Started with SolrCloud>>.
====
[[RunningSolr-CheckifSolrisRunning]]
==== Check if Solr is Running
If you're not sure if Solr is running locally, you can use the status command:
[source,bash]
----
bin/solr status
----
This will search for running Solr instances on your computer and then gather basic information about them, such as the version and memory usage.
That's it! Solr is running. If you need convincing, use a Web browser to see the Admin Console.
`\http://localhost:8983/solr/`
.The Solr Admin interface.
image::images/running-solr/SolrAdminDashboard.png[image,width=900,height=456]
If Solr is not running, your browser will complain that it cannot connect to the server. Check your port number and try again.
[[RunningSolr-CreateaCore]]
== Create a Core
If you did not start Solr with an example configuration, you would need to create a core in order to be able to index and search. You can do so by running:
[source,bash]
----
bin/solr create -c <name>
----
This will create a core that uses a data-driven schema which tries to guess the correct field type when you add documents to the index.
To see all available options for creating a new core, execute:
[source,bash]
----
bin/solr create -help
----
[[RunningSolr-AddDocuments]]
== Add Documents
Solr is built to find documents that match queries. Solr's schema provides an idea of how content is structured (more on the schema <<documents-fields-and-schema-design.adoc#documents-fields-and-schema-design,later>>), but without documents there is nothing to find. Solr needs input before it can do much.
You may want to add a few sample documents before trying to index your own content. The Solr installation comes with different types of example documents located under the sub-directories of the `example/` directory of your installation.
In the `bin/` directory is the post script, a command line tool which can be used to index different types of documents. Do not worry too much about the details for now. The <<indexing-and-basic-data-operations.adoc#indexing-and-basic-data-operations,Indexing and Basic Data Operations>> section has all the details on indexing.
To see some information about the usage of `bin/post`, use the `-help` option. Windows users, see the section for <<post-tool.adoc#post-tool-windows-support,Post Tool on Windows>>.
`bin/post` can post various types of content to Solr, including files in Solr's native XML and JSON formats, CSV files, a directory tree of rich documents, or even a simple short web crawl. See the examples at the end of `bin/post -help` for various commands to easily get started posting your content into Solr.
Go ahead and add all the documents in some example XML files:
[source,plain]
----
$ bin/post -c gettingstarted example/exampledocs/*.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file gb18030-example.xml (application/xml) to [base]
POSTing file hd.xml (application/xml) to [base]
POSTing file ipod_other.xml (application/xml) to [base]
POSTing file ipod_video.xml (application/xml) to [base]
POSTing file manufacturers.xml (application/xml) to [base]
POSTing file mem.xml (application/xml) to [base]
POSTing file money.xml (application/xml) to [base]
POSTing file monitor.xml (application/xml) to [base]
POSTing file monitor2.xml (application/xml) to [base]
POSTing file mp500.xml (application/xml) to [base]
POSTing file sd500.xml (application/xml) to [base]
POSTing file solr.xml (application/xml) to [base]
POSTing file utf8-example.xml (application/xml) to [base]
POSTing file vidcard.xml (application/xml) to [base]
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.153
----
That's it! Solr has indexed the documents contained in those files.
[[RunningSolr-AskQuestions]]
== Ask Questions
Now that you have indexed documents, you can perform queries. The simplest way is by building a URL that includes the query parameters. This is exactly the same as building any other HTTP URL.
For example, the following query searches all document fields for "video":
`\http://localhost:8983/solr/gettingstarted/select?q=video`
Notice how the URL includes the host name (`localhost`), the port number where the server is listening (`8983`), the application name (`solr`), the request handler for queries (`select`), and finally, the query itself (`q=video`).
The results are contained in an XML document, which you can examine directly by clicking on the link above. The document contains two parts. The first part is the `responseHeader`, which contains information about the response itself. The main part of the reply is in the result tag, which contains one or more doc tags, each of which contains fields from documents that match the query. You can use standard XML transformation techniques to mold Solr's results into a form that is suitable for displaying to users. Alternatively, Solr can output the results in JSON, PHP, Ruby and even user-defined formats.
Just in case you are not running Solr as you read, the following screen shot shows the result of a query (the next example, actually) as viewed in Mozilla Firefox. The top-level response contains a `lst` named `responseHeader` and a result named response. Inside result, you can see the three docs that represent the search results.
.An XML response to a query.
image::images/running-solr/solr34_responseHeader.png[image,width=600,height=634]
Once you have mastered the basic idea of a query, it is easy to add enhancements to explore the query syntax. This one is the same as before but the results only contain the ID, name, and price for each returned document. If you don't specify which fields you want, all of them are returned.
`\http://localhost:8983/solr/gettingstarted/select?q=video&fl=id,name,price`
Here is another example which searches for "black" in the `name` field only. If you do not tell Solr which field to search, it will search default fields, as specified in the schema.
`\http://localhost:8983/solr/gettingstarted/select?q=name:black`
You can provide ranges for fields. The following query finds every document whose price is between $0 and $400.
`\http://localhost:8983/solr/gettingstarted/select?q=price:[0%20TO%20400]&fl=id,name,price`
<<faceting.adoc#faceting,Faceted browsing>> is one of Solr's key features. It allows users to narrow search results in ways that are meaningful to your application. For example, a shopping site could provide facets to narrow search results by manufacturer or price.
Faceting information is returned as a third part of Solr's query response. To get a taste of this power, take a look at the following query. It adds `facet=true` and `facet.field=cat`.
`\http://localhost:8983/solr/gettingstarted/select?q=price:[0%20TO%20400]&fl=id,name,price&facet=true&facet.field=cat`
In addition to the familiar `responseHeader` and response from Solr, a `facet_counts` element is also present. Here is a view with the `responseHeader` and response collapsed so you can see the faceting information clearly.
*An XML Response with faceting*
[source,xml]
----
<response>
<lst name="responseHeader">
...
</lst>
<result name="response" numFound="9" start="0">
<doc>
<str name="id">SOLR1000</str>
<str name="name">Solr, the Enterprise Search Server</str>
<float name="price">0.0</float></doc>
...
</result>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="cat">
<int name="electronics">6</int>
<int name="memory">3</int>
<int name="search">2</int>
<int name="software">2</int>
<int name="camera">1</int>
<int name="copier">1</int>
<int name="multifunction printer">1</int>
<int name="music">1</int>
<int name="printer">1</int>
<int name="scanner">1</int>
<int name="connector">0</int>
<int name="currency">0</int>
<int name="graphics card">0</int>
<int name="hard drive">0</int>
<int name="monitor">0</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
</response>
----
The facet information shows how many of the query results have each possible value of the `cat` field. You could easily use this information to provide users with a quick way to narrow their query results. You can filter results by adding one or more filter queries to the Solr request. This request constrains documents with a category of "software".
`\http://localhost:8983/solr/gettingstarted/select?q=price:0%20TO%20400&fl=id,name,price&facet=true&facet.field=cat&fq=cat:software`

View File

@ -138,7 +138,7 @@ The `<ZOOKEEPER_HOME>/conf/zoo2.cfg` file should have the content:
[source,bash]
----
tickTime=2000
dataDir=c:/sw/zookeeperdata/2
dataDir=/var/lib/zookeeperdata/2
clientPort=2182
initLimit=5
syncLimit=2
@ -152,7 +152,7 @@ You'll also need to create `<ZOOKEEPER_HOME>/conf/zoo3.cfg`:
[source,bash]
----
tickTime=2000
dataDir=c:/sw/zookeeperdata/3
dataDir=/var/lib/zookeeperdata/3
clientPort=2183
initLimit=5
syncLimit=2

View File

@ -1,6 +1,6 @@
= A Step Closer
:page-shortname: a-step-closer
:page-permalink: a-step-closer.html
= Solr Configuration Files
:page-shortname: solr-configuration-files
:page-permalink: solr-configuration-files.html
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
@ -18,9 +18,16 @@
// specific language governing permissions and limitations
// under the License.
You already have some idea of Solr's schema. This section describes Solr's home directory and other configuration options.
Solr has several configuration files that you will interact with during your implementation.
When Solr runs in an application server, it needs access to a home directory. The home directory contains important configuration information and is the place where Solr will store its index. The layout of the home directory will look a little different when you are running Solr in standalone mode vs when you are running in SolrCloud mode.
Many of these files are in XML format, although APIs that interact with configuration settings tend to accept JSON for programmatic access as needed.
== Solr Home
When Solr runs, it needs access to a home directory.
When you first install Solr, your home directory is `server/solr`. However, some examples may change this location (such as, if you run `bin/solr start -e cloud`, your home directory will be `example/cloud`).
The home directory contains important configuration information and is the place where Solr will store its index. The layout of the home directory will look a little different when you are running Solr in standalone mode vs. when you are running in SolrCloud mode.
The crucial parts of the Solr home directory are shown in these examples:
@ -56,7 +63,10 @@ The crucial parts of the Solr home directory are shown in these examples:
data/
----
You may see other files, but the main ones you need to know are:
You may see other files, but the main ones you need to know are discussed in the next section.
== Configuration Files
Inside Solr's Home, you'll find these files:
* `solr.xml` specifies configuration options for your Solr server instance. For more information on `solr.xml` see <<solr-cores-and-solr-xml.adoc#solr-cores-and-solr-xml,Solr Cores and solr.xml>>.
* Per Solr Core:
@ -67,4 +77,4 @@ You may see other files, but the main ones you need to know are:
Note that the SolrCloud example does not include a `conf` directory for each Solr Core (so there is no `solrconfig.xml` or Schema file). This is because the configuration files usually found in the `conf` directory are stored in ZooKeeper so they can be propagated across the cluster.
If you are using SolrCloud with the embedded ZooKeeper instance, you may also see `zoo.cfg` and `zoo.data` which are ZooKeeper configuration and data files. However, if you are running your own ZooKeeper ensemble, you would supply your own ZooKeeper configuration file when you start it and the copies in Solr would be unused. For more information about ZooKeeper and SolrCloud, see the section <<solrcloud.adoc#solrcloud,SolrCloud>>.
If you are using SolrCloud with the embedded ZooKeeper instance, you may also see `zoo.cfg` and `zoo.data` which are ZooKeeper configuration and data files. However, if you are running your own ZooKeeper ensemble, you would supply your own ZooKeeper configuration file when you start it and the copies in Solr would be unused. For more information about SolrCloud, see the section <<solrcloud.adoc#solrcloud,SolrCloud>>.

View File

@ -25,7 +25,7 @@ You can start and stop Solr, create and delete collections or cores, perform ope
You can find the script in the `bin/` directory of your Solr installation. The `bin/solr` script makes Solr easier to work with by providing simple commands and options to quickly accomplish common goals.
More examples of `bin/solr` in use are available throughout the Solr Reference Guide, but particularly in the sections <<running-solr.adoc#running-solr,Running Solr>> and <<getting-started-with-solrcloud.adoc#getting-started-with-solrcloud,Getting Started with SolrCloud>>.
More examples of `bin/solr` in use are available throughout the Solr Reference Guide, but particularly in the sections <<installing-solr.adoc#starting-solr,Starting Solr>> and <<getting-started-with-solrcloud.adoc#getting-started-with-solrcloud,Getting Started with SolrCloud>>.
== Starting and Stopping

View File

@ -0,0 +1,48 @@
= Solr System Requirements
:page-shortname: solr-system-requirements
:page-permalink: solr-system-requirements.html
:page-toc: false
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
You can install Solr in any system where a suitable Java Runtime Environment (JRE) is available, as detailed below.
Currently this includes Linux, MacOS/OS X, and Microsoft Windows.
== Installation Requirements
=== Java Requirements
You will need the Java Runtime Environment (JRE) version 1.8 or higher. At a command line, check your Java version like this:
[source,bash]
----
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
----
The exact output will vary, but you need to make sure you meet the minimum version requirement. We also recommend choosing a version that is not end-of-life from its vendor. Oracle or OpenJDK are the most tested JREs and are recommended. It's also recommended to use the latest available official release when possible.
Some versions of Java VM have bugs that may impact your implementation. To be sure, check the page https://wiki.apache.org/lucene-java/JavaBugs[Lucene JavaBugs].
If you don't have the required version, or if the `java` command is not found, download and install the latest version from Oracle at http://www.oracle.com/technetwork/java/javase/downloads/index.html.
=== Supported Operating Systems
Solr is tested on several versions of Linux, MacOS, and Windows.

View File

@ -0,0 +1,980 @@
= Solr Tutorial
:page-shortname: solr-tutorial
:page-permalink: solr-tutorial.html
:page-tocclass: right
:experimental:
This tutorial covers getting Solr up and running, ingesting a variety of data sources into multiple collections,
and getting a feel for the Solr administrative and search interfaces.
It is organized into three sections that each build on the one before it. The <<exercise-1,first exercise>> will ask you to start Solr, create a collection, index some basic documents, and then perform a few searches.
The <<exercise-2,second exercise>> works with a different set of data, and explores requesting facets with the dataset.
The <<exercise-3,third exercise>> encourages you to begin to work with your own data and start a plan for your implementation.
Finally, we'll introduce <<Spatial Queries,spatial search>> and show you how to get your Solr instance back into a clean state.
== Before You Begin
To follow along with this tutorial, you will need...
// TODO possibly remove this system requirements or only replace the link
. To meet the {solr-javadocs}/solr/api/SYSTEM_REQUIREMENTS.html[system requirements]
. An Apache Solr release http://lucene.apache.org/solr/downloads.html[download]. This tutorial is designed for Apache Solr {solr-docs-version}.
For best results, please run the browser showing this tutorial and the Solr server on the same machine so tutorial links will correctly point to your Solr server.
== Unpack Solr
Begin by unzipping the Solr release and changing your working directory to the subdirectory where Solr was installed. For example, with a shell in UNIX, Cygwin, or MacOS:
[source,bash,subs="verbatim,attributes+"]
----
~$ ls solr*
solr-{solr-docs-version}.0.zip
~$ unzip -q solr-{solr-docs-version}.0.zip
~$ cd solr-{solr-docs-version}.0/
----
If you'd like to know more about Solr's directory layout before moving to the first exercise, see the section <<installing-solr.adoc#directory-layout,Directory Layout>> for details.
[[exercise-1]]
== Exercise 1: Index Techproducts Example Data
This exercise will walk you through how to start Solr as a two-node cluster (both nodes on the same machine) and create a collection during startup. Then you will index some sample data that ships with Solr and do some basic searches.
=== Launch Solr in SolrCloud Mode
To launch Solr, run: `bin/solr start -e cloud` on Unix or MacOS; `bin\solr.cmd start -e cloud` on Windows.
This will start an interactive session that will start two Solr "servers" on your machine. This command has an option to run without prompting you for input (`-noprompt`), but we want to modify two of the defaults so we won't use that option now.
[source,subs="verbatim,attributes+"]
----
solr-{solr-docs-version}.0:$ ./bin/solr start -e cloud
Welcome to the SolrCloud example!
This interactive session will help you launch a SolrCloud cluster on your local workstation.
To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2]:
----
The first prompt asks how many nodes we want to run. Note the `[2]` at the end of the last line; that is the default number of nodes. Two is what we want for this example, so you can simply press kbd:[enter].
[source,subs="verbatim,attributes+"]
----
Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.
Please enter the port for node1 [8983]:
----
This will be the port that the first node runs on. Unless you know you have something else running on port 8983 on your machine, accept this default option also by pressing kbd:[enter]. If something is already using that port, you will be asked to choose another port.
[source,subs="verbatim,attributes+"]
----
Please enter the port for node2 [7574]:
----
This is the port the second node will run on. Again, unless you know you have something else running on port 8983 on your machine, accept this default option also by pressing kbd:[enter]. If something is already using that port, you will be asked to choose another port.
Solr will now initialize itself and start running on those two nodes. The script will print the commands it uses for your reference.
[source,subs="verbatim,attributes+"]
----
Starting up 2 Solr nodes for your example SolrCloud cluster.
Creating Solr home directory /solr-{solr-docs-version}.0/example/cloud/node1/solr
Cloning /solr-{solr-docs-version}.0/example/cloud/node1 into
/solr-{solr-docs-version}.0/example/cloud/node2
Starting up Solr on port 8983 using command:
"bin/solr" start -cloud -p 8983 -s "example/cloud/node1/solr"
Waiting up to 180 seconds to see Solr running on port 8983 [\]
Started Solr server on port 8983 (pid=34942). Happy searching!
Starting up Solr on port 7574 using command:
"bin/solr" start -cloud -p 7574 -s "example/cloud/node2/solr" -z localhost:9983
Waiting up to 180 seconds to see Solr running on port 7574 [\]
Started Solr server on port 7574 (pid=35036). Happy searching!
INFO - 2017-07-27 12:28:02.835; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at localhost:9983 ready
----
Notice that two instances of Solr have started on two nodes. Because we are starting in SolrCloud mode, and did not define any details about an external ZooKeeper cluster, Solr launches its own ZooKeeper and connects both nodes to it.
After startup is complete, you'll be prompted to create a collection to use for indexing data.
[source,subs="verbatim,attributes+"]
----
Now let's create a new collection for indexing documents in your 2-node cluster.
Please provide a name for your new collection: [gettingstarted]
----
Here's the first place where we'll deviate from the default options. This tutorial will ask you to index some sample data included with Solr, called the "techproducts" data. Let's name our collection "techproducts" so it's easy to differentiate from other collections we'll create later. Enter `techproducts` at the prompt and hit kbd:[enter].
[source,subs="verbatim,attributes+"]
----
How many shards would you like to split techproducts into? [2]
----
This is asking how many <<solr-glossary.adoc#shard,shards>> you want to split your index into across the two nodes. Choosing "2" (the default) means we will split the index relatively evenly across both nodes, which is a good way to start. Accept the default by hitting kbd:[enter].
[source,subs="verbatim,attributes+"]
----
How many replicas per shard would you like to create? [2]
----
A replica is a copy of the index that's used for failover (see also the <<solr-glossary.adoc#replica,Solr Glossary definition>>). Again, the default of "2" is fine to start with here also, so accept the default by hitting kbd:[enter].
[source,subs="verbatim,attributes+"]
----
Please choose a configuration for the techproducts collection, available options are:
_default or sample_techproducts_configs [_default]
----
We've reached another point where we will deviate from the default option. Solr has two sample sets of configuration files (called a _configSet_) available out-of-the-box.
A collection must have a configSet, which at a minimum includes the two main configuration files for Solr: the schema file (named either `managed-schema` or `schema.xml`), and `solrconfig.xml`. The question here is which configSet you would like to start with. The `_default` is a bare-bones option, but note there's one whose name includes "techproducts", the same as we named our collection. This configSet is specifically designed to support the sample data we want to use, so enter `sample_techproducts_configs` at the prompt and hit kbd:[enter].
At this point, Solr will create the collection and again output to the screen the commands it issues.
[source,subs="verbatim,attributes+"]
----
Uploading /solr-{solr-docs-version}.0/server/solr/configsets/_default/conf for config techproducts to ZooKeeper at localhost:9983
Connecting to ZooKeeper at localhost:9983 ...
INFO - 2017-07-27 12:48:59.289; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at localhost:9983 ready
Uploading /solr-{solr-docs-version}.0/server/solr/configsets/sample_techproducts_configs/conf for config techproducts to ZooKeeper at localhost:9983
Creating new collection 'techproducts' using command:
http://localhost:8983/solr/admin/collections?action=CREATE&name=techproducts&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=techproducts
{
"responseHeader":{
"status":0,
"QTime":5460},
"success":{
"192.168.0.110:7574_solr":{
"responseHeader":{
"status":0,
"QTime":4056},
"core":"techproducts_shard1_replica_n1"},
"192.168.0.110:8983_solr":{
"responseHeader":{
"status":0,
"QTime":4056},
"core":"techproducts_shard2_replica_n2"}}}
Enabling auto soft-commits with maxTime 3 secs using the Config API
POSTing request to Config API: http://localhost:8983/solr/techproducts/config
{"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}}
Successfully set-property updateHandler.autoSoftCommit.maxTime to 3000
SolrCloud example running, please visit: http://localhost:8983/solr
----
*Congratulations!* Solr is ready for data!
You can see that Solr is running by launching the Solr Admin UI in your web browser: http://localhost:8983/solr/. This is the main starting point for administering Solr.
Solr will now be running two "nodes", one on port 7574 and one on port 8983. There is one collection created automatically, `techproducts`, a two shard collection, each with two replicas.
The http://localhost:8983/solr/#/~cloud[Cloud tab] in the Admin UI diagrams the collection nicely:
.SolrCloud Diagram
image::images/solr-tutorial/tutorial-solrcloud.png[]
=== Index the Techproducts Data
Your Solr server is up and running, but it doesn't contain any data yet, so we can't do any queries.
Solr includes the `bin/post` tool in order to facilitate indexing various types of documents easily. We'll use this tool for the indexing examples below.
You'll need a command shell to run some of the following examples, rooted in the Solr install directory; the shell from where you launched Solr works just fine.
NOTE: Currently the `bin/post` tool does not have a comparable Windows script, but the underlying Java program invoked is available. We'll show examples below for Windows, but you can also see the <<post-tool.adoc#post-tool-windows-support,Windows section>> of the Post Tool documentation for more details.
The data we will index is in the `example/exampledocs` directory. The documents are in a mix of document formats (JSON, CSV, etc.), and fortunately we can index them all at once:
.Linux/Mac
[source,subs="verbatim,attributes+"]
----
solr-{solr-docs-version}.0:$ bin/post -c techproducts example/exampledocs/*
----
.Windows
[source,subs="verbatim,attributes+"]
----
C:\solr-{solr-docs-version}.0> java -jar -Dc=techproducts -Dauto example\exampledocs\post.jar example\exampledocs\*
----
You should see output similar to the following:
[source,subs="verbatim,attributes+"]
----
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/techproducts/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.csv (text/csv) to [base]
POSTing file books.json (application/json) to [base]/json/docs
POSTing file gb18030-example.xml (application/xml) to [base]
POSTing file hd.xml (application/xml) to [base]
POSTing file ipod_other.xml (application/xml) to [base]
POSTing file ipod_video.xml (application/xml) to [base]
POSTing file manufacturers.xml (application/xml) to [base]
POSTing file mem.xml (application/xml) to [base]
POSTing file money.xml (application/xml) to [base]
POSTing file monitor.xml (application/xml) to [base]
POSTing file monitor2.xml (application/xml) to [base]
POSTing file more_books.jsonl (application/json) to [base]/json/docs
POSTing file mp500.xml (application/xml) to [base]
POSTing file post.jar (application/octet-stream) to [base]/extract
POSTing file sample.html (text/html) to [base]/extract
POSTing file sd500.xml (application/xml) to [base]
POSTing file solr-word.pdf (application/pdf) to [base]/extract
POSTing file solr.xml (application/xml) to [base]
POSTing file test_utf8.sh (application/octet-stream) to [base]/extract
POSTing file utf8-example.xml (application/xml) to [base]
POSTing file vidcard.xml (application/xml) to [base]
21 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/techproducts/update...
Time spent: 0:00:00.822
----
Congratulations again! You have data in your Solr!
Now we're ready to start searching.
[[tutorial-searching]]
=== Basic Searching
Solr can be queried via REST clients, curl, wget, Chrome POSTMAN, etc., as well as via native clients available for many programming languages.
The Solr Admin UI includes a query builder interface via the Query tab for the `techproducts` collection (at http://localhost:8983/solr/#/techproducts/query). If you click the btn:[Execute Query] button without changing anything in the form, you'll get 10 documents in JSON format:
.Query Screen
image::images/solr-tutorial/tutorial-query-screen.png[Solr Quick Start: techproducts Query screen with results]
The URL sent by the Admin UI to Solr is shown in light grey near the top right of the above screenshot. If you click on it, your browser will show you the raw response.
To use curl, give the same URL shown in your browser in quotes on the command line:
`curl "http://localhost:8983/solr/techproducts/select?indent=on&q=\*:*"`
What's happening here is that we are using Solr's query parameter (`q`) with a special syntax that requests all documents in the index (`\*:*`). All of the documents are not returned to us, however, because of the default for a parameter called `rows`, which you can see in the form is `10`. You can change the parameter in the UI or in the defaults if you wish.
Solr has very powerful search options, and this tutorial won't be able to cover all of them. But we can cover some of the most common types of queries.
==== Search for a Single Term
To search for a term, enter it as the `q` param value in the Solr Admin UI Query screen, replacing `\*:*` with the term you want to find.
Enter "foundation" and hit btn:[Execute Query] again.
If you prefer curl, enter something like this:
`curl "http://localhost:8983/solr/techproducts/select?q=foundation"`
You'll see something like this:
[source,json]
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":8,
"params":{
"q":"foundation"}},
"response":{"numFound":4,"start":0,"maxScore":2.7879646,"docs":[
{
"id":"0553293354",
"cat":["book"],
"name":"Foundation",
"price":7.99,
"price_c":"7.99,USD",
"inStock":true,
"author":"Isaac Asimov",
"author_s":"Isaac Asimov",
"series_t":"Foundation Novels",
"sequence_i":1,
"genre_s":"scifi",
"_version_":1574100232473411586,
"price_c____l_ns":799}]
}}
The response indicates that there are 4 hits (`"numFound":4`). We've only included one document the above sample output, but since 4 hits is lower than the `rows` parameter default of 10 to be returned, you should see all 4 of them.
Note the `responseHeader` before the documents. This header will include the parameters you have set for the search. By default it shows only the parameters _you_ have set for this query, which in this case is only your query term.
The documents we got back include all the fields for each document that were indexed. This is, again, default behavior. If you want to restrict the fields in the response, you can use the `fl` param, which takes a comma-separated list of field names. This is one of the available fields on the query form in the Admin UI.
Put "id" (without quotes) in the "fl" box and hit btn:[Execute Query] again. Or, to specify it with curl:
`curl "http://localhost:8983/solr/techproducts/select?q=foundation&fl=id"`
You should only see the IDs of the matching records returned.
==== Field Searches
All Solr queries look for documents using some field. Often you want to query across multiple fields at the same time, and this is what we've done so far with the "foundation" query. This is possible with the use of copy fields, which are set up already with this set of configurations. We'll cover copy fields a little bit more in Exercise 2.
Sometimes, though, you want to limit your query to a single field. This can make your queries more efficient and the results more relevant for users.
Much of the data in our small sample data set is related to products. Let's say we want to find all the "electronics" products in the index. In the Query screen, enter "electronics" (without quotes) in the `q` box and hit btn:[Execute Query]. You should get 14 results, such as:
[source,json]
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":6,
"params":{
"q":"electronics"}},
"response":{"numFound":14,"start":0,"maxScore":1.5579545,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable",
"manu":"Belkin",
"manu_id_s":"belkin",
"cat":["electronics",
"connector"],
"features":["car power adapter for iPod, white"],
"weight":2.0,
"price":11.5,
"price_c":"11.50,USD",
"popularity":1,
"inStock":false,
"store":"37.7752,-122.4232",
"manufacturedate_dt":"2006-02-14T23:55:59Z",
"_version_":1574100232554151936,
"price_c____l_ns":1150}]
}}
This search finds all documents that contain the term "electronics" anywhere in the indexed fields. However, we can see from the above there is a `cat` field (for "category"). If we limit our search for only documents with the category "electronics", the results will be more precise for our users.
Update your query in the `q` field of the Admin UI so it's `cat:electronics`. Now you get 12 results:
[source,json]
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":6,
"params":{
"q":"cat:electronics"}},
"response":{"numFound":12,"start":0,"maxScore":0.9614112,"docs":[
{
"id":"SP2514N",
"name":"Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133",
"manu":"Samsung Electronics Co. Ltd.",
"manu_id_s":"samsung",
"cat":["electronics",
"hard drive"],
"features":["7200RPM, 8MB cache, IDE Ultra ATA-133",
"NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor"],
"price":92.0,
"price_c":"92.0,USD",
"popularity":6,
"inStock":true,
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"store":"35.0752,-97.032",
"_version_":1574100232511160320,
"price_c____l_ns":9200}]
}}
Using curl, this query would look like this:
`curl "http://localhost:8983/solr/techproducts/select?q=cat:electronics"`
==== Phrase Search
To search for a multi-term phrase, enclose it in double quotes: `q="multiple terms here"`. For example, search for "CAS latency" by entering that phrase in quotes to the `q` box in the Admin UI.
If you're following along with curl, note that the space between terms must be converted to "+" in a URL, as so:
`curl "http://localhost:8983/solr/techproducts/select?q=\"CAS+latency\""`
We get 2 results:
[source,json]
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":7,
"params":{
"q":"\"CAS latency\""}},
"response":{"numFound":2,"start":0,"maxScore":5.937691,"docs":[
{
"id":"VDBDB1A16",
"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM",
"manu":"A-DATA Technology Inc.",
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"features":["CAS latency 3, 2.7v"],
"popularity":0,
"inStock":true,
"store":"45.18414,-93.88141",
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":"electronics|0.9 memory|0.1",
"_version_":1574100232590852096},
{
"id":"TWINX2048-3200PRO",
"name":"CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail",
"manu":"Corsair Microsystems Inc.",
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"features":["CAS latency 2, 2-3-3-6 timing, 2.75v, unbuffered, heat-spreader"],
"price":185.0,
"price_c":"185.00,USD",
"popularity":5,
"inStock":true,
"store":"37.7752,-122.4232",
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":"electronics|6.0 memory|3.0",
"_version_":1574100232584560640,
"price_c____l_ns":18500}]
}}
==== Combining Searches
By default, when you search for multiple terms and/or phrases in a single query, Solr will only require that one of them is present in order for a document to match. Documents containing more terms will be sorted higher in the results list.
You can require that a term or phrase is present by prefixing it with a `+`; conversely, to disallow the presence of a term or phrase, prefix it with a `-`.
To find documents that contain both terms "electronics" and "music", enter `+electronics +music` in the `q` box in the Admin UI Query tab.
If you're using curl, you must encode the `+` character because it has a reserved purpose in URLs (encoding the space character). The encoding for `+` is `%2B`:
`curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics%20%2Bmusic"`
You should only get a single result.
To search for documents that contain the term "electronics" but *don't* contain the term "music", enter `+electronics -music` in the `q` box in the Admin UI. For curl, again, URL encode "+" as "%2B":
`curl "http://localhost:8983/solr/techproducts/select?q=%2Belectronics+-music"`
This time you get 13 results.
==== More Information on Searching
We have only scratched the surface of the search options available in Solr. For more Solr search options, see the section on <<searching.adoc#searching,Searching>>.
=== Exercise 1 Wrap Up
At this point, you've seen how Solr can index data and have done some basic queries. You can choose now to continue to the next example which will introduce more Solr concepts, such as faceting results and managing your schema, or you can strike out on your own.
If you decide not to continue with this tutorial, the data we've indexed so far is likely of little value to you. You can delete your installation and start over, or you can use the `bin/solr` script we started out with to delete this collection:
`bin/solr delete -c techproducts`
And then create a new collection:
`bin/solr create -c <yourCollection> -s 2 -rf 2`
To stop both of the Solr nodes we started, issue the command:
`bin/solr stop -all`
For more information on start/stop and collection options with `bin/solr`, see <<solr-control-script-reference.adoc#solr-control-script-reference,Solr Control Script Reference>>.
[[exercise-2]]
== Exercise 2: Modify the Schema and Index Films Data
This exercise will build on the last one and introduce you to the index schema and Solr's powerful faceting features.
=== Restart Solr
Did you stop Solr after the last exercise? No? Then go ahead to the next section.
If you did, though, and need to restart Solr, issue these commands:
`./bin/solr start -c -p 8983 -s example/cloud/node1/solr`
This starts the first node. When it's done start the second node, and tell it how to connect to to ZooKeeper:
`./bin/solr start -c -p 7574 -s example/cloud/node2/solr -z localhost:9983`
=== Create a New Collection
We're going to use a whole new data set in this exercise, so it would be better to have a new collection instead of trying to reuse the one we had before.
One reason for this is we're going to use a feature in Solr called "field guessing", where Solr attempts to guess what type of data is in a field while it's indexing it. It also automatically creates new fields in the schema for new fields that appear in incoming documents. This mode is called "Schemaless". We'll see the benefits and limitations of this approach to help you decide how and where to use it in your real application.
.What is a "schema" and why do I need one?
[sidebar]
****
Solr's schema is a single file (in XML) that stores the details about the fields and field types Solr is expected to understand. The schema defines not only the field or field type names, but also any modifications that should happen to a field before it is indexed. For example, if you want to ensure that a user who enters "abc" and a user who enters "ABC" can both find a document containing the term "ABC", you will want to normalize (lower-case it, in this case) "ABC" when it is indexed, and normalize the user query to be sure of a match. These rules are defined in your schema.
Earlier in the tutorial we mentioned copy fields, which are fields made up of data that originated from other fields. You can also define dynamic fields, which use wildcards (such as `*_t` or `*_s`) to dynamically create fields of a specific field type. These types of rules are also defined in the schema.
****
When you initially started Solr in the first exercise, we had a choice of a configSet to use. The one we chose had a schema that was pre-defined for the data we later indexed. This time, we're going to use a configSet that has a very minimal schema and let Solr figure out from the data what fields to add.
The data you're going to index is related to movies, so start by creating a collection named "films" that uses the `_default` configSet:
`bin/solr create -c films -s 2 -rf 2`
Whoa, wait. We didn't specify a configSet! That's fine, the `_default` is appropriately named, since it's the default and is used if you don't specify one at all.
We did, however, set two parameters `-s` and `-rf`. Those are the number of shards to split the collection across (2) and how many replicas to create (2). This is equivalent to the options we had during the interactive example from the first exercise.
You should see output like:
[source,subs="verbatim,attributes+"]
----
WARNING: Using _default configset. Data driven schema functionality is enabled by default, which is
NOT RECOMMENDED for production use.
To turn it off:
curl http://localhost:7574/solr/films/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}'
Connecting to ZooKeeper at localhost:9983 ...
INFO - 2017-07-27 15:07:46.191; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at localhost:9983 ready
Uploading /{solr-docs-version}.0/server/solr/configsets/_default/conf for config films to ZooKeeper at localhost:9983
Creating new collection 'films' using command:
http://localhost:7574/solr/admin/collections?action=CREATE&name=films&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=films
{
"responseHeader":{
"status":0,
"QTime":3830},
"success":{
"192.168.0.110:8983_solr":{
"responseHeader":{
"status":0,
"QTime":2076},
"core":"films_shard2_replica_n1"},
"192.168.0.110:7574_solr":{
"responseHeader":{
"status":0,
"QTime":2494},
"core":"films_shard1_replica_n2"}}}
----
The first thing the command printed was a warning about not using this configSet in production. That's due to some of the limitations we'll cover shortly.
Otherwise, though, the collection should be created. If we go to the Admin UI at http://localhost:8983/solr/#/films/collection-overview we should see the overview screen.
==== Preparing Schemaless for the Films Data
There are two parallel things happening with the schema that comes with the `_default` configSet.
First, we are using a "managed schema", which is configured to only be modified by Solr's Schema API. That means we should not hand-edit it so there isn't confusion about which edits come from which source. Solr's Schema API allows us to make changes to fields, field types, and other types of schema rules.
Second, we are using "field guessing", which is configured in the `solrconfig.xml` file (and includes most of Solr's various configuration settings). Field guessing is designed to allow us to start using Solr without having to define all the fields we think will be in our documents before trying to index them. This is why we call it "schemaless", because you can start quickly and let Solr create fields for you as it encounters them in documents.
Sounds great! Well, not really, there are limitations. It's a bit brute force, and if it guesses wrong, you can't change much about a field after data has been indexed without having to reindex. If we only have a few thousand documents that might not be bad, but if you have millions and millions of documents, or, worse, don't have access to the original data anymore, this can be a real problem.
For these reasons, the Solr community does not recommend going to production without a schema that you have defined yourself. By this we mean that the schemaless features are fine to start with, but you should still always make sure your schema matches your expectations for how you want your data indexed and how users are going to query it.
It is possible to mix schemaless features with a defined schema. Using the Schema API, you can define a few fields that you know you want to control, and let Solr guess others that are less important or which you are confident (through testing) will be guessed to your satisfaction. That's what we're going to do here.
===== Create the "names" Field
The films data we are going to index has a small number of fields for each movie: an ID, director name(s), film name, release date, and genre(s).
If you look at one of the files in `example/films`, you'll see the first film is named _.45_, released in 2006. As the first document in the dataset, Solr is going to guess the field type based on the data in the record. If we go ahead and index this data, that first film name is going to indicate to Solr that that field type is a "float" numeric field, and will create a "name" field with a type `FloatPointField`. All data after this record will be expected to be a float.
Well, that's not going to work. We have titles like _A Mighty Wind_ and _Chicken Run_, which are strings - decidedly not numeric and not floats. If we let Solr guess the "name" field is a float, what will happen is later titles will cause an error and indexing will fail. That's not going to get us very far.
What we can do is set up the "name" field in Solr before we index the data to be sure Solr always interprets it as a string. At the command line, enter this curl command:
[source,bash]
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' http://localhost:8983/solr/films/schema
This command uses the Schema API to explicitly define a field named "name" that has the field type "text_general" (a text field). It will not be permitted to have multiple values, but it will be stored (meaning it can be retrieved by queries).
You can also use the Admin UI to create fields, but it offers a bit less control over the properties of your field. It will work for our case, though:
.Creating a field
image::images/solr-tutorial/tutorial-add-field.png[]
===== Create a "catchall" Copy Field
There's one more change to make before we start indexing.
In the first exercise when we queried the documents we had indexed, we didn't have to specify a field to search because the configuration we used was set up to copy fields into a `text` field, and that field was the default when no other field was defined in the query.
The configuration we're using now doesn't have that rule. We would need to define a field to search for every query. We can, however, set up a "catchall field" by defining a copy field that will take all data from all fields and index it into a field named `\_text_`. Let's do that now.
You can use either the Admin UI or the Schema API for this.
At the command line, use the Schema API again to define a copy field:
[source,bash]
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" : {"source":"*","dest":"_text_"}}' http://localhost:8983/solr/films/schema
In the Admin UI, choose btn:[Add Copy Field], then fill out the source and destination for your field, as in this screenshot.
.Creating a copy field
image::images/solr-tutorial/tutorial-add-copy-field.png[]
What this does is make a copy of all fields and put the data into the "\_text_" field.
TIP: It can be very expensive to do this with your production data because it tells Solr to effectively index everything twice. It will make indexing slower, and make your index larger. With your production data, you will want to be sure you only copy fields that really warrant it for your application.
OK, now we're ready to index the data and start playing around with it.
=== Index Sample Film Data
The films data we will index is located in the `example/films` directory of your installation. It comes in three formats: JSON, XML and CSV. Pick one of the formats and index it into the "films" collection (in each example, one command is for Unix/MacOS and the other is for Windows):
.To Index JSON Format
[source,subs="verbatim,attributes+"]
----
bin/post -c films example/films/films.json
C:\solr-{solr-docs-version}.0> java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films\*.json
----
.To Index XML Format
[source,subs="verbatim,attributes+"]
----
bin/post -c films example/films/films.xml
C:\solr-{solr-docs-version}.0> java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films\*.xml
----
.To Index CSV Format
[source,subs="verbatim,attributes+"]
----
bin/post -c films example/films/films.csv -params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
C:\solr-{solr-docs-version}.0> java -jar -Dc=films -Dparams=f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=| -Dauto example\exampledocs\post.jar example\films\*.csv
----
Each command includes these main parameters:
* `-c films`: this is the Solr collection to index data to.
* `example/films/films.json` (or `films.xml` or `films.csv`): this is the path to the data file to index. You could simply supply the directory where this file resides, but since you know the format you want to index, specifying the exact file for that format is more efficient.
Note the CSV command includes extra parameters. This is to ensure multi-valued entries in the "genre" and "directed_by" columns are split by the pipe (`|`) character, used in this file as a separator. Telling Solr to split these columns this way will ensure proper indexing of the data.
Each command will produce output similar to the below seen while indexing JSON:
[source,bash,subs="verbatim,attributes"]
----
$ ./bin/post -c films example/films/films.json
/bin/java -classpath /solr-{solr-docs-version}.0/dist/solr-core-{solr-docs-version}.0.jar -Dauto=yes -Dc=films -Ddata=files org.apache.solr.util.SimplePostTool example/films/films.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/films/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file films.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/films/update...
Time spent: 0:00:00.878
----
Hooray!
If you go to the Query screen in the Admin UI for films (http://localhost:8983/solr/#/films/query) and hit btn:[Execute Query] you should see 1100 results, with the first 10 returned to the screen.
Let's do a query to see if the "catchall" field worked properly. Enter "comedy" in the `q` box and hit btn:[Execute Query] again. You should see get 417 results. Feel free to play around with other searches before we move on to faceting.
[[tutorial-faceting]]
=== Faceting
One of Solr's most popular features is faceting. Faceting allows the search results to be arranged into subsets (or buckets, or categories), providing a count for each subset. There are several types of faceting: field values, numeric and date ranges, pivots (decision tree), and arbitrary query faceting.
==== Field Facets
In addition to providing search results, a Solr query can return the number of documents that contain each unique value in the whole result set.
On the Admin UI Query tab, if you check the `facet` checkbox, you'll see a few facet-related options appear:
.Facet options in the Query screen
image::images/solr-tutorial/tutorial-admin-ui-facet-options.png[Solr Quick Start: Query tab facet options]
To see facet counts from all documents (`q=\*:*`): turn on faceting (`facet=true`), and specify the field to facet on via the `facet.field` param. If you only want facets, and no document contents, specify `rows=0`. The `curl` command below will return facet counts for the `genre_str` field:
`curl "http://localhost:8983/solr/films/select?q=\*:*&rows=0&facet=true&facet.field=genre_str"`
In your terminal, you'll see something like:
[source,json]
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":11,
"params":{
"q":"*:*",
"facet.field":"genre_str",
"rows":"0",
"facet":"true"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"genre_str":[
"Drama",552,
"Comedy",389,
"Romance Film",270,
"Thriller",259,
"Action Film",196,
"Crime Fiction",170,
"World cinema",167]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}
We've truncated the output here a little bit, but in the `facet_counts` section, you see by default you get a count of the number of documents using each genre for every genre in the index. Solr has a parameter `facet.mincount` that you could use to limit the facets to only those that contain a certain number of documents (this parameter is not shown in the UI). Or, perhaps you do want all the facets, and you'll let your application's front-end control how it's displayed to users.
If you wanted to control the number of items in a bucket, you could do something like this:
`curl "http://localhost:8983/solr/films/select?=&q=\*:*&facet.field=genre_str&facet.mincount=200&facet=on&rows=0"`
You should only see 4 facets returned.
There are a great deal of other parameters available to help you control how Solr constructs the facets and facet lists. We'll cover some of them in this exercise, but you can also see the section <<faceting.adoc#faceting,Faceting>> for more detail.
==== Range Facets
For numerics or dates, it's often desirable to partition the facet counts into ranges rather than discrete values. A prime example of numeric range faceting, using the example techproducts data from our previous exercise, is `price`. In the `/browse` UI, it looks like this:
.Range facets
image::images/solr-tutorial/tutorial-range-facet.png[Solr Quick Start: Range facets]
The films data includes the release date for films, and we could use that to create date range facets, which are another common use for range facets.
The Solr Admin UI doesn't yet support range facet options, so you will need to use curl or similar command line tool for the following examples.
If we construct a query that looks like this:
[source,bash]
curl 'http://localhost:8983/solr/films/select?q=*:*&rows=0'\
'&facet=true'\
'&facet.range=initial_release_date'\
'&facet.range.start=NOW-20YEAR'\
'&facet.range.end=NOW'\
'&facet.range.gap=%2B1YEAR'
This will request all films and ask for them to be grouped by year starting with 20 years ago (our earliest release date is in 2000) and ending today. Note that this query again URL encodes a `+` as `%2B`.
In the terminal you will see:
[source,json]
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":8,
"params":{
"facet.range":"initial_release_date",
"facet.limit":"300",
"q":"*:*",
"facet.range.gap":"+1YEAR",
"rows":"0",
"facet":"on",
"facet.range.start":"NOW-20YEAR",
"facet.range.end":"NOW"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"initial_release_date":{
"counts":[
"1997-07-28T17:12:06.919Z",0,
"1998-07-28T17:12:06.919Z",0,
"1999-07-28T17:12:06.919Z",48,
"2000-07-28T17:12:06.919Z",82,
"2001-07-28T17:12:06.919Z",103,
"2002-07-28T17:12:06.919Z",131,
"2003-07-28T17:12:06.919Z",137,
"2004-07-28T17:12:06.919Z",163,
"2005-07-28T17:12:06.919Z",189,
"2006-07-28T17:12:06.919Z",92,
"2007-07-28T17:12:06.919Z",26,
"2008-07-28T17:12:06.919Z",7,
"2009-07-28T17:12:06.919Z",3,
"2010-07-28T17:12:06.919Z",0,
"2011-07-28T17:12:06.919Z",0,
"2012-07-28T17:12:06.919Z",1,
"2013-07-28T17:12:06.919Z",1,
"2014-07-28T17:12:06.919Z",1,
"2015-07-28T17:12:06.919Z",0,
"2016-07-28T17:12:06.919Z",0],
"gap":"+1YEAR",
"start":"1997-07-28T17:12:06.919Z",
"end":"2017-07-28T17:12:06.919Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
==== Pivot Facets
Another faceting type is pivot facets, also known as "decision trees", allowing two or more fields to be nested for all the various possible combinations. Using the films data, pivot facets can be used to see how many of the films in the "Drama" category (the `genre_str` field) are directed by a director. Here's how to get at the raw data for this scenario:
`curl "http://localhost:8983/solr/films/select?q=\*:*&rows=0&facet=on&facet.pivot=genre_str,directed_by_str"`
This results in the following response, which shows a facet for each category and director combination:
[source,json]
{"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":1147,
"params":{
"q":"*:*",
"facet.pivot":"genre_str,directed_by_str",
"rows":"0",
"facet":"on"}},
"response":{"numFound":1100,"start":0,"maxScore":1.0,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{},
"facet_pivot":{
"genre_str,directed_by_str":[{
"field":"genre_str",
"value":"Drama",
"count":552,
"pivot":[{
"field":"directed_by_str",
"value":"Ridley Scott",
"count":5},
{
"field":"directed_by_str",
"value":"Steven Soderbergh",
"count":5},
{
"field":"directed_by_str",
"value":"Michael Winterbottom",
"count":4}}]}]}}}
We've truncated this output as well - you will see a lot of genres and directors in your screen.
=== Exercise 2 Wrap Up
In this exercise, we learned a little bit more about how Solr organizes data in the indexes, and how to work with the Schema API to manipulate the schema file. We also learned a bit about facets in Solr, including range facets and pivot facets. In both of these things, we've only scratched the surface of the available options. If you can dream it, it might be possible!
Like our previous exercise, this data may not be relevant to your needs. We can clean up our work by deleting the collection. To do that, issue this command at the command line:
`bin/solr delete -c films`
[[exercise-3]]
== Exercise 3: Index Your Own Data
For this last exercise, work with a dataset of your choice. This can be files on your local hard drive, a set of data you have worked with before, or maybe a sample of the data you intend to index to Solr for your production application.
This exercise is intended to get you thinking about what you will need to do for your application:
* What sorts of data do you need to index?
* What will you need to do to prepare Solr for your data (such as, create specific fields, set up copy fields, determine analysis rules, etc.)
* What kinds of search options do you want to provide to users?
* How much testing will you need to do to ensure everything works the way you expect?
=== Create Your Own Collection
Before you get started, create a new collection, named whatever you'd like. In this example, the collection will be named "localDocs"; replace that name with whatever name you choose if you want to.
`./bin/solr create -c localDocs -s 2 -rf 2`
Again, as we saw from Exercise 2 above, this will use the `_default` configSet and all the schemaless features it provides. As we noted previously, this may cause problems when we index our data. You may need to iterate on indexing a few times before you get the schema right.
=== Indexing Ideas
Solr has lots of ways to index data. Choose one of the approaches below and try it out with your system:
Local Files with bin/post::
If you have a local directory of files, the Post Tool (`bin/post`) can index a directory of files. We saw this in action in our first exercise.
+
We used only JSON, XML and CSV in our exercises, but the Post Tool can also handle HTML, PDF, Microsoft Office formats (such as MS Word), plain text, and more.
+
In this example, assume there is a directory named "Documents" locally. To index it, we would issue a command like this (correcting the collection name after the `-c` parameter as needed):
+
`./bin/post -c localDocs ~/Documents`
+
You may get errors as it works through your documents. These might be caused by the field guessing, or the file type may not be supported. Indexing content such as this demonstrates the need to plan Solr for your data, which requires understanding it and perhaps also some trial and error.
DataImportHandler::
Solr includes a tool called the <<uploading-structured-data-store-data-with-the-data-import-handler.adoc#uploading-structured-data-store-data-with-the-data-import-handler,Data Import Handler (DIH)>> which can connect to databases (if you have a jdbc driver), mail servers, or other structured data sources. There are several examples included for feeds, GMail, and a small HSQL database.
+
The `README.txt` file in `example/example-DIH` will give you details on how to start working with this tool.
SolrJ::
SolrJ is a Java-based client for interacting with Solr. Use <<using-solrj.adoc#using-solrj,SolrJ>> for JVM-based languages or other <<client-apis.adoc#client-apis,Solr clients>> to programmatically create documents to send to Solr.
Documents Screen::
Use the Admin UI <<documents-screen.adoc#documents-screen,Documents tab>> (at http://localhost:8983/solr/#/localDocs/documents) to paste in a document to be indexed, or select `Document Builder` from the `Document Type` dropdown to build a document one field at a time. Click on the btn:[Submit Document] button below the form to index your document.
=== Updating Data
You may notice that even if you index content in this tutorial more than once, it does not duplicate the results found. This is because the example Solr schema (a file named either `managed-schema` or `schema.xml`) specifies a `uniqueKey` field called `id`. Whenever you POST commands to Solr to add a document with the same value for the `uniqueKey` as an existing document, it automatically replaces it for you.
You can see that that has happened by looking at the values for `numDocs` and `maxDoc` in the core-specific Overview section of the Solr Admin UI.
`numDocs` represents the number of searchable documents in the index (and will be larger than the number of XML, JSON, or CSV files since some files contained more than one document). The `maxDoc` value may be larger as the `maxDoc` count includes logically deleted documents that have not yet been physically removed from the index. You can re-post the sample files over and over again as much as you want and `numDocs` will never increase, because the new documents will constantly be replacing the old.
Go ahead and edit any of the existing example data files, change some of the data, and re-run the PostTool (`bin/post`) again. You'll see your changes reflected in subsequent searches.
=== Deleting Data
If you need to iterate a few times to get your schema right, you may want to delete documents to clear out the collection and try again. Note, however, that merely removing documents doesn't change the underlying field definitions. Essentially, this will allow you to reindex your data after making changes to fields for your needs.
You can delete data by POSTing a delete command to the update URL and specifying the value of the document's unique key field, or a query that matches multiple documents (be careful with that one!). We can use `bin/post` to delete documents also if we structure the request properly.
Execute the following command to delete a specific document:
`bin/post -c localDocs -d "<delete><id>SP2514N</id></delete>"`
To delete all documents, you can use "delete-by-query" command like:
`bin/post -c localDocs -d "<delete><query>*:*</query></delete>"`
You can also modify the above to only delete documents that match a specific query.
=== Exercise 3 Wrap Up
At this point, you're ready to start working on your own.
Jump ahead to the overall <<Wrapping Up,wrap up>> when you're ready to stop Solr and remove all the examples you worked with and start fresh.
== Spatial Queries
Solr has sophisticated geospatial support, including searching within a specified distance range of a given location (or within a bounding box), sorting by distance, or even boosting results by the distance.
Some of the example techproducts documents we indexed in Exercise 1 have locations associated with them to illustrate the spatial capabilities. To re-index this data, see <<index-the-techproducts-data,Exercise 1>>.
Spatial queries can be combined with any other types of queries, such as in this example of querying for "ipod" within 10 kilometers from San Francisco:
.Spatial queries and results
image::images/solr-tutorial/tutorial-spatial.png[Solr Quick Start: spatial search]
This is from Solr's example search UI (called `/browse`), which has a nice feature to show a map for each item and allow easy selection of the location to search near. You can see this yourself by going to <http://localhost:8983/solr/techproducts/browse?q=ipod&pt=37.7752%2C-122.4232&d=10&sfield=store&fq=%7B%21bbox%7D&queryOpts=spatial&queryOpts=spatial> in a browser.
To learn more about Solr's spatial capabilities, see the section <<spatial-search.adoc#spatial-search,Spatial Search>>.
== Wrapping Up
If you've run the full set of commands in this quick start guide you have done the following:
* Launched Solr into SolrCloud mode, two nodes, two collections including shards and replicas
* Indexed several types of files
* Used the Schema API to modify your schema
* Opened the admin console, used its query interface to get results
* Opened the /browse interface to explore Solr's features in a more friendly and familiar interface
Nice work!
== Cleanup
As you work through this tutorial, you may want to stop Solr and reset the environment back to the starting point. The following command line will stop Solr and remove the directories for each of the two nodes that were created all the way back in Exercise 1:
`bin/solr stop -all ; rm -Rf example/cloud/`
== Where to next?
This Guide will be your best resource for learning more about Solr.
Solr also has a robust community made up of people happy to help you get started. For more information, check out the Solr website's http://lucene.apache.org/solr/resources.html[Resources page].

View File

@ -1,6 +1,6 @@
= Upgrading Solr
:page-shortname: upgrading-solr
:page-permalink: upgrading-solr.html
= Solr Upgrade Notes
:page-shortname: solr-upgrade-notes
:page-permalink: solr-upgrade-notes.html
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
@ -18,10 +18,16 @@
// specific language governing permissions and limitations
// under the License.
If you are already using Solr 6.5, Solr 6.6 should not present any major problems. However, you should review the {solr-javadocs}/changes/Changes.html[`CHANGES.txt`] file found in your Solr package for changes and updates that may effect your existing implementation. Detailed steps for upgrading a Solr cluster can be found in the appendix: <<upgrading-a-solr-cluster.adoc#upgrading-a-solr-cluster,Upgrading a Solr Cluster>>.
The following notes describe changes to Solr in recent releases that you should be aware of before upgrading.
These notes are meant to highlight the biggest changes that may impact the largest number of implementations. It is not a comprehensive list of all changes to Solr in any release.
When planning your Solr upgrade, consider the customizations you have made to your system and review the {solr-javadocs}/changes/Changes.html[`CHANGES.txt`] file found in your Solr package. That file includes all of the changes and updates that may effect your existing implementation. Detailed steps for upgrading a Solr cluster can be found in the appendix: <<upgrading-a-solr-cluster.adoc#upgrading-a-solr-cluster,Upgrading a Solr Cluster>>.
== Upgrading from 6.5.x
If you are already using Solr 6.5, Solr 6.6 should not present any major problems.
* Solr contribs map-reduce, morphlines-core and morphlines-cell have been removed.
* JSON Facet API now uses hyper-log-log for numBuckets cardinality calculation and calculates cardinality before filtering buckets by any mincount greater than 1.

View File

@ -0,0 +1,690 @@
= Statistical Programming
:page-shortname: statistical-programming
:page-permalink: statistical-programming.html
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
The Streaming Expression language includes a powerful statistical programing syntax with many of the
features of a functional programming language. The syntax includes *variables*, *data structures*
and a growing set of *mathematical functions*.
Using the statistical programing syntax Solr's powerful *data retrieval*
capabilities can be combined with in-depth *statistical analysis*.
The *data retrieval* methods include:
* SQL
* time series aggregation
* random sampling
* faceted aggregation
* KNN searches
* topic message queues
* MapReduce (parallel relational algebra)
* JDBC calls to outside databases
* Graph Expressions
Once the data is retrieved, the statistical programming syntax can be used to create *arrays* from the data so it
can be *manipulated*, *transformed* and *analyzed*.
The statistical function library includes functions that perform:
* Correlation
* Cross-correlation
* Covariance
* Moving averages
* Percentiles
* Simple regression and prediction
* ANOVA
* Histograms
* Convolution
* Euclidean distance
* Descriptive statistics
* Rank transformation
* Normalization transformation
* Sequences
* Array manipulation functions (creation, copying, length, scaling, reverse etc...)
The statistical function library is backed by *Apache Commons Math*.
This document provides an overview of the how to apply the variables, data structures
and mathematical functions.
== /stream handler
Like all Streaming Expressions, the statistical functions can be run by Solr's /stream handler.
== Math
Streaming Expressions contain a suite of *mathematical functions* which can be called on
their own or as part of a larger expression.
Solr's /stream handler evaluates the mathematical expression and returns a result.
For example sending the following expression to the /stream handler:
[source,text]
----
add(1, 1)
----
Returns the following response:
[source,text]
----
{
"result-set": {
"docs": [
{
"return-value": 2
},
{
"EOF": true,
"RESPONSE_TIME": 2
}
]
}
}
----
You can nest math functions within each other. For example:
[source,text]
----
pow(10, add(1,1))
----
Returns the following response:
[source,text]
----
{
"result-set": {
"docs": [
{
"return-value": 100
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
You can also perform math on a stream of Tuples.
For example:
[source,text]
----
select(search(collection2, q="*:*", fl="price_f", sort="price_f desc", rows="3"),
price_f,
mult(price_f, 10) as newPrice)
----
Returns the following response:
[source, text]
----
{
"result-set": {
"docs": [
{
"price_f": 0.99999994,
"newPrice": 9.9999994
},
{
"price_f": 0.99999994,
"newPrice": 9.9999994
},
{
"price_f": 0.9999992,
"newPrice": 9.999992
},
{
"EOF": true,
"RESPONSE_TIME": 3
}
]
}
}
----
== Array (data structure)
The first data structure we'll explore is the *array*.
We can create an array with the `array` function:
For example:
[source,text]
----
array(1, 2, 3)
----
Returns the following response:
[source,text]
----
{
"result-set": {
"docs": [
{
"return-value": [
1,
2,
3
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
We can nest arrays within arrays to form a *matrix*:
[source,text]
----
array(array(1, 2, 3),
array(4, 5, 6))
----
Returns the following response:
[source,text]
----
{
"result-set": {
"docs": [
{
"return-value": [
[
1,
2,
3
],
[
4,
5,
6
]
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
We can manipulate arrays with functions.
For example we can reverse and array with the `rev` function:
[source,text]
----
rev(array(1, 2, 3))
----
Returns the following response:
[source,text]
----
{
"result-set": {
"docs": [
{
"return-value": [
3,
2,
1
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
Arrays can also be built and returned by functions.
For example the sequence function:
[source,text]
----
sequence(5,0,1)
----
This returns an array of size *5* starting from *0* with a stride of *1*.
[source,text]
----
{
"result-set": {
"docs": [
{
"return-value": [
0,
1,
2,
3,
4
]
},
{
"EOF": true,
"RESPONSE_TIME": 4
}
]
}
}
----
We can perform math on an array.
For example we can scale an array with the `scale` function:
Expression:
[source,text]
----
scale(10, sequence(5,0,1))
----
Returns the following response:
[source,text]
----
{
"result-set": {
"docs": [
{
"return-value": [
0,
10,
20,
30,
40
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
We can perform *statistical analysis* on arrays.
For example we can correlate two sequences with the `corr` function:
[source,text]
----
corr(sequence(5,1,1), sequence(5,10,10))
----
Returns the following response:
[source,text]
----
{
"result-set": {
"docs": [
{
"return-value": 1
},
{
"EOF": true,
"RESPONSE_TIME": 1
}
]
}
}
----
== Tuple (data structure)
The *tuple* is the next data structure we'll explore.
The `tuple` function returns a map of name/value pairs. A tuple is a very flexible data structure
that can hold values that are strings, numerics, arrays and lists of tuples.
A tuple can be used to return a complex result from a statistical expression.
Here is an example:
[source,text]
----
tuple(title="hello world",
array1=array(1,2,3,4),
array2=array(4,5,6,7))
Returns the following response:
----
[source,text]
----
{
"result-set": {
"docs": [
{
"title": "hello world",
"array1": [
1,
2,
3,
4
],
"array2": [
4,
5,
6,
7
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== List (data structure)
Next we have the *list* data structure.
The `list` function is a data structure that wraps Streaming Expressions and emits all the tuples from the wrapped
expressions as a single concatenated stream.
Below is an example of a list of tuples:
[source,text]
----
list(tuple(id=1, data=array(1, 2, 3)),
tuple(id=2, data=array(10, 12, 14)))
----
Returns the following response:
[source,text]
----
{
"result-set": {
"docs": [
{
"id": "1",
"data": [
1,
2,
3
]
},
{
"id": "2",
"data": [
10,
12,
14
]
},
{
"EOF": true,
"RESPONSE_TIME": 0
}
]
}
}
----
== Let (setting variables)
The `let` function sets *variables* and runs a Streaming Expression that references the variables. The `let` funtion can be used to
write small statistical programs.
A *variable* can be set to the output of any Streaming Expression.
Here is a very simple example:
[source,text]
----
let(a=random(collection2, q="*:*", rows="3", fl="price_f"),
b=random(collection2, q="*:*", rows="3", fl="price_f"),
tuple(sample1=a, sample2=b))
----
The `let` expression above is setting variables *a* and *b* to random
samples taken from collection2.
The `let` function then executes the `tuple` streaming expression
which references the two variables.
Here is the output:
[source,text]
----
{
"result-set": {
"docs": [
{
"sample1": [
{
"price_f": 0.39729273
},
{
"price_f": 0.063344836
},
{
"price_f": 0.42020327
}
],
"sample2": [
{
"price_f": 0.659244
},
{
"price_f": 0.58797807
},
{
"price_f": 0.57520163
}
]
},
{
"EOF": true,
"RESPONSE_TIME": 20
}
]
}
}
----
== Creating arrays with `col` function
The `col` function is used to move a column of numbers from a list of tuples into an `array`.
This is an important function because Streaming Expressions such as `sql`, `random` and `timeseries` return tuples,
but the statistical functions operate on arrays.
Below is an example of the `col` function:
[source,text]
----
let(a=random(collection2, q="*:*", rows="3", fl="price_f"),
b=random(collection2, q="*:*", rows="3", fl="price_f"),
c=col(a, price_f),
d=col(b, price_f),
tuple(sample1=c, sample2=d))
----
The example above is using the `col` function to create arrays from the tuples stored in
variables *a* and *b*.
Variable *c* contains an array of values from the *price_f* field,
taken from the tuples stored in variable *a*.
Variable *d* contains an array of values from the *price_f* field,
taken from the tuples stored in variable *b*.
Also notice inn that the response `tuple` executed by `let` is pointing to the arrays in variables *c* and *d*.
The response shows the arrays:
[source,text]
----
{
"result-set": {
"docs": [
{
"sample1": [
0.06490427,
0.6751543,
0.07063508
],
"sample2": [
0.8884564,
0.8878821,
0.3504665
]
},
{
"EOF": true,
"RESPONSE_TIME": 17
}
]
}
}
----
== Statistical Programming Example
We've covered how the *data structures*, *variables* and a few *statistical functions* work.
Let's dive into an example that puts these tools to use.
=== Use case
We have an existing hotel in *cityA* that is very profitable.
We are contemplating opening up a new hotel in a different city.
We're considering 4 different cities: *cityB*, *cityC*, *cityD*, *cityE*.
We'd like to open a hotel in a city that has similar room rates to *cityA*.
How do we determine which of the 4 cities we're considering has room rates which are most similar to *cityA*?
=== The Data
We have a data set of un-aggregated hotel *bookings*. Each booking record has a rate and city.
=== Can we simply aggregate?
One approach would be to aggregate the data from each city and compare the *mean* room rates. This approach will
give us some useful information, but the mean is a summary statistic which loses a significant amount of information
about the data. For example we don't have an understanding of how the distribution of room rates is impacting the
mean.
The *median* room rate provides another interesting data point but it's still not the entire picture. It's sill just
one point of reference.
Is there a way that we can compare the markets without losing valuable information in the data?
=== K Nearest Neighbor
The use case we're reasoning about can often be approached using a K Nearest Neighbor (knn) algorithm.
With knn we use a *distance* measure to compare vectors of data to find the k nearest neighbors to
a specific vector.
=== Euclidean Distance
The Streaming Expression statistical function library has a function called `distance`. The `distance` function
computes the Euclidean distance between two vectors. This looks promising for comparing vectors of room rates.
=== Vectors
But how to create the vectors from a our data set? Remember we have un-aggregated room rates from each of the cities.
How can we vectorize the data so it can be compared using the `distance` function.
We have a Streaming Expression that can retrieve a *random sample* from each of the cities. The name of this
expression is `random`. So we could take a random sample of 1000 room rates from each of the five cities.
But random vectors of room rates are not comparable because the distance algorithm compares values at each index
in the vector. How can make these vectors comparable?
We can make them comparable by *sorting* them. Then as the distance algorithm moves along the vectors it will be
comparing room rates from lowest to highest in both cities.
=== The code
[source,text]
----
let(cityA=sort(random(bookings, q="city:cityA", rows="1000", fl="rate_d"), by="rate_d asc"),
cityB=sort(random(bookings, q="city:cityB", rows="1000", fl="rate_d"), by="rate_d asc"),
cityC=sort(random(bookings, q="city:cityC", rows="1000", fl="rate_d"), by="rate_d asc"),
cityD=sort(random(bookings, q="city:cityD", rows="1000", fl="rate_d"), by="rate_d asc"),
cityE=sort(random(bookings, q="city:cityE", rows="1000", fl="rate_d"), by="rate_d asc"),
ratesA=col(cityA, rate_d),
ratesB=col(cityB, rate_d),
ratesC=col(cityC, rate_d),
ratesD=col(cityD, rate_d),
ratesE=col(cityE, rate_d),
top(n=1,
sort="distance asc",
list(tuple(city=B, distance=distance(ratesA, ratesB)),
tuple(city=C, distance=distance(ratesA, ratesC)),
tuple(city=D, distance=distance(ratesA, ratesD)),
tuple(city=E, distance=distance(ratesA, ratesE)))))
----
==== The code explained
The `let` expression sets variables first.
The first 5 variables (cityA, cityB, cityC, cityD, cityE), contain the random samples from the `bookings` collection.
the `random` function is pulling 1000 random samples from each city and including the `rate_d` field in the
tuples that are returned.
The `random` function is wrapped by a `sort` function which is sorting the tuples in
ascending order based on the rate_d field.
The next five variables (ratesA, ratesB, ratesC, ratesD, ratesE) contain the arrays of room rates for each
city. The `col` function is used to move the `rate_d` field from the random sample tuples
into an array for each city.
Now we have five sorted vectors of room rates that we can compare with our `distance` function.
After the variables are set the `let` expression runs the `top` expression.
The `top` expression is wrapping a `list` of `tuples`. Inside each tuple the `distance` function is used to compare
the rateA vector with one of the other cities. The output of the distance function is stored in the distance field
in the tuple.
The `list` function emits each `tuple` and the `top` function returns only the tuple with the lowest distance.

View File

@ -269,31 +269,158 @@ if(gt(fieldA,fieldB),floor(fieldA),floor(fieldB)) // if fieldA > fieldB then ret
== sin
//TODO
The `sin` function returns the trigonometirc sine of a number.
=== sin Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the sine of.
=== sin Syntax
[source,text]
----
sin(100.4) // returns the sine of 100.4
sine(fieldA) // returns the sine for fieldA.
if(gt(fieldA,fieldB),sin(fieldA),sin(fieldB)) // if fieldA > fieldB then return the sine of fieldA, else return the sine of fieldB
----
== asin
//TODO
The `asin` function returns the trigonometirc arcsine of a number.
=== asin Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the arcsine of.
=== asin Syntax
[source,text]
----
asin(100.4) // returns the sine of 100.4
asine(fieldA) // returns the sine for fieldA.
if(gt(fieldA,fieldB),asin(fieldA),asin(fieldB)) // if fieldA > fieldB then return the asine of fieldA, else return the asine of fieldB
----
== hsin
The `hsin` function returns the trigonometirc hyperbolic sine of a number.
=== hsin Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the hyperbolic sine of.
=== hsin Syntax
[source,text]
----
hsin(100.4) // returns the hsine of 100.4
hsine(fieldA) // returns the hsine for fieldA.
if(gt(fieldA,fieldB),sin(fieldA),sin(fieldB)) // if fieldA > fieldB then return the hsine of fieldA, else return the hsine of fieldB
----
== sinh
//TODO
== cos
//TODO
The `cos` function returns the trigonometirc cosine of a number.
=== cos Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the hyperbolic cosine of.
=== cos Syntax
[source,text]
----
cos(100.4) // returns the arccosine of 100.4
cos(fieldA) // returns the arccosine for fieldA.
if(gt(fieldA,fieldB),cos(fieldA),cos(fieldB)) // if fieldA > fieldB then return the arccosine of fieldA, else return the cosine of fieldB
----
== acos
//TODO
The `acos` function returns the trigonometirc arccosine of a number.
=== acos Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the arccosine of.
=== acos Syntax
[source,text]
----
acos(100.4) // returns the arccosine of 100.4
acos(fieldA) // returns the arccosine for fieldA.
if(gt(fieldA,fieldB),sin(fieldA),sin(fieldB)) // if fieldA > fieldB then return the arccosine of fieldA, else return the arccosine of fieldB
----
== atan
//TODO
The `atan` function returns the trigonometirc arctangent of a number.
=== atan Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the arctangent of.
=== atan Syntax
[source,text]
----
atan(100.4) // returns the arctangent of 100.4
atan(fieldA) // returns the arctangent for fieldA.
if(gt(fieldA,fieldB),atan(fieldA),atan(fieldB)) // if fieldA > fieldB then return the arctanget of fieldA, else return the arctangent of fieldB
----
== round
//TODO
The `round` function returns the closest whole number to the argument
=== round Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the square root of.
=== round Syntax
[source,text]
----
round(100.4)
round(fieldA)
if(gt(fieldA,fieldB),sqrt(fieldA),sqrt(fieldB)) // if fieldA > fieldB then return the round of fieldA, else return the round of fieldB
----
== sqrt
//TODO
The `sqrt` function returns the trigonometirc square root of a number.
=== sqrt Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the square root of.
=== sqrt Syntax
[source,text]
----
sqrt(100.4) // returns the square root of 100.4
sqrt(fieldA) // returns the square root for fieldA.
if(gt(fieldA,fieldB),sqrt(fieldA),sqrt(fieldB)) // if fieldA > fieldB then return the sqrt of fieldA, else return the sqrt of fieldB
----
== cbrt
The `cbrt` function returns the trigonometirc cube root of a number.
=== cbrt Parameters
* `Field Name | Raw Number | Number Evaluator`: The value to return the cube root of.
=== cbrt Syntax
[source,text]
----
cbrt(100.4) // returns the square root of 100.4
cbrt(fieldA) // returns the square root for fieldA.
if(gt(fieldA,fieldB),cbrt(fieldA),cbrt(fieldB)) // if fieldA > fieldB then return the cbrt of fieldA, else return the cbrt of fieldB
----
== and
The `and` function will return the logical AND of at least 2 boolean parameters. The function will fail to execute if any parameters are non-boolean or null. Returns a boolean value.
@ -530,3 +657,374 @@ raw(true) // "true" (note: this returns the string "true" and not the boolean tr
eq(raw(fieldA), fieldA) // true if the value of fieldA equals the string "fieldA"
----
== movingAvg
The `movingAvg` function calculates a moving average over an array of numbers.
(https://en.wikipedia.org/wiki/Moving_average)
=== movingAvg Parameters
* `numeric array`
* `window size`
=== movingAvg Returns
A numeric array with the moving average. The array returned will be smaller then the
orignal array by the window size.
=== movingAvg Syntax
movingAverage(numericArray, 30)
== anova
The `anova` function calculates the analysis of variance for two or more numeric arrays.
(https://en.wikipedia.org/wiki/Analysis_of_variance)
=== anova Parameters
* `numeric array` ... (two or more)
=== anova Returns
A tuple with the results of ANOVA
=== anova Syntax
anova(numericArray1, numericArray2) // calculats ANOVA for two numeric arrays
anova(numericArray1, numericArray2, numericArray2) // calculats ANOVA for three numeric arrays
== hist
The `hist` function creates a histogram from a numeric array. The hist function is designed
to work with continuous variables.
=== hist Parameters
* `numeric array`
* `bins` (The number of bins in the histogram)
=== hist Returns
A List of containing a tuple for each bin the the histogram. Each tuple contains
summary statistics for the observations that were within the bin.
=== hist Sytnax
hist(numericArray, bins)
== array
The array function returns an array of numerics or other objects including other arrays.
=== array Parameters
* `numeric` | `array` ...
=== array Returns
array
=== array Syntax
array(1, 2, 3) // Array of numerics
array(array(1,2,3), array(4,5,6)) // Array of arrays
== sequence
The `sequence` function returns an array of numbers based on its parameters.
=== sequence Parameters
* `length`
* `start`
* `stride`
=== sequence Returns
numeric array
=== sequence Syntax
sequence(100, 0, 1) // Returns a sequence of length 100, starting from 0 with a stride of 1.
== finddelay
The `finddelay` function performs a cross-correlation between two numeric arrays and returns the delay.
=== finddelay Parameters
* `numeric array`
* `numeric array`
=== finddelay Returns
Integer delay
=== finddelay Syntax
finddelay(numericArray1, numericArray2)
== describe
The `describe` function returns a tuple containing the desciptive statistics for an array.
=== describe Parameters
* `numeric array`
=== describe Returns
Tuple containing descriptive statistics
=== describe Syntax
describe(numericArray)
== copyOfRange
The `copyOfRange` function creates a copy of a range of a numeric array.
=== copyOfRange Parameters
* `numeric array`
* `start index`
* `end index`
=== copyOfRange Returns
A copy of an arrar starting from the start index inclusive and ending at the end index
exclusive.
=== copyOfRange Syntax
copyOfRange(numericArray, startIndex, endIndex)
== copyOf
The `copyOf` function creates a copy of a numeric array.
=== copyOf Parameters
* `numeric array`
* `length`
=== copyOf Returns
A copy of the numeric array starting from zero of size of the length parameter. The returned
array will be right padded with zeros if the length parameter exceeds the size of the
original array.
=== copyOf Syntax
copyOf(numericArray, length)
== distance
The `distance` function calculates the Euclidian distance of two numeric arrays.
=== distance Parameters
* `numeric array`
* `numeric array`
=== distance Returns
number
=== distance Syntax
distance(numericArray1, numuericArray2))
== scale
The `scale` function multiplies all the elements of an array by a number.
=== scale Parameters
* `number`
* `numeric array`
=== scale Returns
A numeric array with the scaled values
=== scale Syntax
scale(number, numericArray)
== rank
The `rank` performs a *rank transformaion* on a numeric array.
=== rank Parameters
* `numeric array`
=== rank Returns
numeric array with rank transformed values
=== rank Syntax
rank(numericArray)
== length
The `length` function returns the length of a numeric array.
=== length Parameters
* `numeric array`
=== length Returns
integer
=== length Syntax
length(numericArray)
== rev
The `rev` function reverses the order of a numeric array.
=== rev Parameters
* `numeric array`
=== rev Returns
A copy of a numeric array with its elements reveresed.
=== rev Syntax
rev(numericArray)
== regress
The `regress` function performs a simple regression on two numeric arrays.
=== regress Parameters
* `numeric array`
* `numeric array`
=== regress Returns
A regression result tuple which holds the regression results. The regression
result tuple is also used by the predict function.
=== regress Syntax
regress(numericArray1, numericArray2)
== predict
The `predict` function predicts the value of an dependent variable based on
the output of the regress function.
=== predict Parameters
* `regress output`
* `numeric predictor`
=== predict Returns
The predicted value
=== predict Syntax
predict(regressOutput, predictor)
== col
The `col` function returns a numeric array from a list of Tuples. The col
function is used to create numeric arrays from stream sources.
=== col Parameters
* `list of Tuples`
* `field name`
=== col Returns
A numeric array from a list of tuples. The field name parameter specifies which field
to create the array from.
=== col Syntax
col(tupleList, fieldName)
== cov
The `cov` function returns the covariance of two numeric arrays.
=== cov Parameters
* `numeric array`
* `numeric array`
=== cov Returns
Number
=== cov Syntax
cov(numericArray, numericArray)
== corr
The `corr` function returns the Pearson Product Moment Correlation of two numeric arrays.
=== corr Parameters
* `numeric array`
* `numeric array`
=== corr Returns
double between -1 and 1
=== corr Synax
corr(numericArray1, numericArray2)
== conv
The `conv` function returns the convolution (https://en.wikipedia.org/wiki/Convolution) of two numeric arrays.
=== conv Parameters
* `numeric array`
* `numeric array`
=== conv Returns
A numeric array with the convolution of the two array parameters.
=== conv Syntax
conv(numericArray1, numericArray2)
== normalize
The `normalize` function normalizes a numeric array so that values within the array
have a mean of 0 and standard deviation of 1.
=== normalize Parameters
* `numeric array`
=== normalize Returns
A numeric array with normalized values.
=== normalize Syntax
normalize(numericArray)

View File

@ -21,24 +21,19 @@
This page covers how to upgrade an existing Solr cluster that was installed using the <<taking-solr-to-production.adoc#taking-solr-to-production,service installation scripts>>.
[IMPORTANT]
====
The steps outlined on this page assume you use the default service name of "```solr```". If you use an alternate service name or Solr installation directory, some of the paths and commands mentioned below will have to be modified accordingly.
====
IMPORTANT: The steps outlined on this page assume you use the default service name of `solr`. If you use an alternate service name or Solr installation directory, some of the paths and commands mentioned below will have to be modified accordingly.
== Planning Your Upgrade
Here is a checklist of things you need to prepare before starting the upgrade process:
1. Examine the <<upgrading-solr.adoc#upgrading-solr,Upgrading Solr>> page to determine if any behavior changes in the new version of Solr will affect your installation.
2. If not using replication (ie: collections with replicationFactor > 1), then you should make a backup of each collection. If all of your collections use replication, then you don't technically need to make a backup since you will be upgrading and verifying each node individually.
3. Determine which Solr node is currently hosting the Overseer leader process in SolrCloud, as you should upgrade this node last. To determine the Overseer, use the Overseer Status API, see: <<collections-api.adoc#collections-api,Collections API>>.
4. Plan to perform your upgrade during a system maintenance window if possible. You'll be doing a rolling restart of your cluster (each node, one-by-one), but we still recommend doing the upgrade when system usage is minimal.
5. Verify the cluster is currently healthy and all replicas are active, as you should not perform an upgrade on a degraded cluster.
6. Re-build and test all custom server-side components against the new Solr JAR files.
7. Determine the values of the following variables that are used by the Solr Control Scripts:
. Examine the <<solr-upgrade-notes.adoc#solr-upgrade-notes,Solr Upgrade Notes>> page to determine if any behavior changes in the new version of Solr will affect your installation.
. If not using replication (ie: collections with replicationFactor > 1), then you should make a backup of each collection. If all of your collections use replication, then you don't technically need to make a backup since you will be upgrading and verifying each node individually.
. Determine which Solr node is currently hosting the Overseer leader process in SolrCloud, as you should upgrade this node last. To determine the Overseer, use the Overseer Status API, see: <<collections-api.adoc#collections-api,Collections API>>.
. Plan to perform your upgrade during a system maintenance window if possible. You'll be doing a rolling restart of your cluster (each node, one-by-one), but we still recommend doing the upgrade when system usage is minimal.
. Verify the cluster is currently healthy and all replicas are active, as you should not perform an upgrade on a degraded cluster.
. Re-build and test all custom server-side components against the new Solr JAR files.
. Determine the values of the following variables that are used by the Solr Control Scripts:
* `ZK_HOST`: The ZooKeeper connection string your current SolrCloud nodes use to connect to ZooKeeper; this value will be the same for all nodes in the cluster.
* `SOLR_HOST`: The hostname each Solr node used to register with ZooKeeper when joining the SolrCloud cluster; this value will be used to set the *host* Java system property when starting the new Solr process.
* `SOLR_PORT`: The port each Solr node is listening on, such as 8983.