lucene/solr/solr-ref-guide/src/running-solr-on-hdfs.adoc

= Running Solr on HDFS
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements.  See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership.  The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License.  You may obtain a copy of the License at
//
//   http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied.  See the License for the
// specific language governing permissions and limitations
// under the License.

Solr has support for writing and reading its index and transaction log files to the HDFS distributed filesystem.

This does not use Hadoop MapReduce to process Solr data, rather it only uses the HDFS filesystem for index and transaction log file storage.

To use HDFS rather than a local filesystem, you must be using Hadoop 2.x and you will need to instruct Solr to use the `HdfsDirectoryFactory`. There are also several additional parameters to define. These can be set in one of three ways:

* Pass JVM arguments to the `bin/solr` script. These would need to be passed every time you start Solr with `bin/solr`.
* Modify `solr.in.sh` (or `solr.in.cmd` on Windows) to pass the JVM arguments automatically when using `bin/solr` without having to set them manually.
* Define the properties in `solrconfig.xml`. These configuration changes would need to be repeated for every collection, so is a good option if you only want some of your collections stored in HDFS.

== Starting Solr on HDFS

=== Standalone Solr Instances

For standalone Solr instances, there are a few parameters you should modify before starting Solr. These can be set in `solrconfig.xml` (more on that <<HdfsDirectoryFactory Parameters,below>>), or passed to the `bin/solr` script at startup.

* You need to use an `HdfsDirectoryFactory` and a data directory in the form `hdfs://host:port/path`
* You need to specify an `updateLog` location in the form `hdfs://host:port/path`
* You should specify a lock factory type of `'hdfs'` or none.

If you do not modify `solrconfig.xml`, you can instead start Solr on HDFS with the following command:

[source,bash]
----
bin/solr start -Dsolr.directoryFactory=HdfsDirectoryFactory
     -Dsolr.lock.type=hdfs
     -Dsolr.data.dir=hdfs://host:port/path
     -Dsolr.updatelog=hdfs://host:port/path
----

This example will start Solr in standalone mode, using the defined JVM properties (explained in more detail <<HdfsDirectoryFactory Parameters,below>>).

=== SolrCloud Instances

In SolrCloud mode, it's best to leave the data and update log directories as the defaults Solr comes with and simply specify the `solr.hdfs.home`. All dynamically created collections will create the appropriate directories automatically under the `solr.hdfs.home` root directory.

* Set `solr.hdfs.home` in the form `hdfs://host:port/path`
* You should specify a lock factory type of `'hdfs'` or none.

[source,bash]
----
bin/solr start -c -Dsolr.directoryFactory=HdfsDirectoryFactory
     -Dsolr.lock.type=hdfs
     -Dsolr.hdfs.home=hdfs://host:port/path
----

This command starts Solr in SolrCloud mode, using the defined JVM properties.


=== Modifying solr.in.sh (*nix) or solr.in.cmd (Windows)

The examples above assume you will pass JVM arguments as part of the start command every time you use `bin/solr` to start Solr. However, `bin/solr` looks for an include file named `solr.in.sh` (`solr.in.cmd` on Windows) to set environment variables. By default, this file is found in the `bin` directory, and you can modify it to permanently add the `HdfsDirectoryFactory` settings and ensure they are used every time Solr is started.

For example, to set JVM arguments to always use HDFS when running in SolrCloud mode (as shown above), you would add a section such as this:

[source,bash]
----
# Set HDFS DirectoryFactory & Settings
-Dsolr.directoryFactory=HdfsDirectoryFactory \
-Dsolr.lock.type=hdfs \
-Dsolr.hdfs.home=hdfs://host:port/path \
----

== The Block Cache

For performance, the `HdfsDirectoryFactory` uses a Directory that will cache HDFS blocks. This caching mechanism replaces the standard file system cache that Solr utilizes. By default, this cache is allocated off-heap. This cache will often need to be quite large and you may need to raise the off-heap memory limit for the specific JVM you are running Solr in. For the Oracle/OpenJDK JVMs, the following is an example command-line parameter that you can use to raise the limit when starting Solr:

[source,bash]
----
-XX:MaxDirectMemorySize=20g
----

== HdfsDirectoryFactory Parameters

The `HdfsDirectoryFactory` has a number of settings defined as part of the `directoryFactory` configuration.

=== Solr HDFS Settings

`solr.hdfs.home`::
A root location in HDFS for Solr to write collection data to. Rather than specifying an HDFS location for the data directory or update log directory, use this to specify one root location and have everything automatically created within this HDFS location. The structure of this parameter is `hdfs://host:port/path/solr`.

=== Block Cache Settings

`solr.hdfs.blockcache.enabled`::
Enable the blockcache. The default is `true`.

`solr.hdfs.blockcache.read.enabled`::
Enable the read cache. The default is `true`.

`solr.hdfs.blockcache.direct.memory.allocation`::
Enable direct memory allocation. If this is `false`, heap is used. The default is `true`.

`solr.hdfs.blockcache.slab.count`::
Number of memory slabs to allocate. Each slab is 128 MB in size. The default is `1`.

`solr.hdfs.blockcache.global`::
Enable/Disable using one global cache for all SolrCores. The settings used will be from the first HdfsDirectoryFactory created. The default is `true`.

=== NRTCachingDirectory Settings

`solr.hdfs.nrtcachingdirectory.enable`:: true |
Enable the use of NRTCachingDirectory. The default is `true`.

`solr.hdfs.nrtcachingdirectory.maxmergesizemb`::
NRTCachingDirectory max segment size for merges. The default is `16`.

`solr.hdfs.nrtcachingdirectory.maxcachedmb`::
NRTCachingDirectory max cache size. The default is `192`.

=== HDFS Client Configuration Settings

`solr.hdfs.confdir`::
Pass the location of HDFS client configuration files - needed for HDFS HA for example.

=== Kerberos Authentication Settings

Hadoop can be configured to use the Kerberos protocol to verify user identity when trying to access core services like HDFS. If your HDFS directories are protected using Kerberos, then you need to configure Solr's HdfsDirectoryFactory to authenticate using Kerberos in order to read and write to HDFS. To enable Kerberos authentication from Solr, you need to set the following parameters:

`solr.hdfs.security.kerberos.enabled`::
Set to `true` to enable Kerberos authentication. The default is `false`.

`solr.hdfs.security.kerberos.keytabfile`::
A keytab file contains pairs of Kerberos principals and encrypted keys which allows for password-less authentication when Solr attempts to authenticate with secure Hadoop.
+
This file will need to be present on all Solr servers at the same path provided in this parameter.

`solr.hdfs.security.kerberos.principal`::
The Kerberos principal that Solr should use to authenticate to secure Hadoop; the format of a typical Kerberos V5 principal is: `primary/instance@realm`.

== Example solrconfig.xml for HDFS

Here is a sample `solrconfig.xml` configuration for storing Solr indexes on HDFS:

[source,xml]
----
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
  <str name="solr.hdfs.home">hdfs://host:port/solr</str>
  <bool name="solr.hdfs.blockcache.enabled">true</bool>
  <int name="solr.hdfs.blockcache.slab.count">1</int>
  <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
  <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
  <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
  <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
  <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
  <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
----

If using Kerberos, you will need to add the three Kerberos related properties to the `<directoryFactory>` element in `solrconfig.xml`, such as:

[source,xml]
----
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
   ...
  <bool name="solr.hdfs.security.kerberos.enabled">true</bool>
  <str name="solr.hdfs.security.kerberos.keytabfile">/etc/krb5.keytab</str>
  <str name="solr.hdfs.security.kerberos.principal">solr/admin@KERBEROS.COM</str>
</directoryFactory>
----

// In Solr 8, this should be removed entirely;
// it's here now only for back-compat for existing users

== Automatically Add Replicas in SolrCloud

The ability to automatically add new replicas when the Overseer notices that a shard has gone down was previously only available to users running Solr in HDFS, but it is now available to all users via Solr's autoscaling framework. See the section <<solrcloud-autoscaling-auto-add-replicas.adoc#the-autoaddreplicas-parameter,SolrCloud Autoscaling Automatically Adding Replicas>> for details on how to enable and disable this feature.

[WARNING]
====
The ability to enable or disable the autoAddReplicas feature with cluster properties has been deprecated and will be removed in a future version. All users of this feature who have previously used that approach are encouraged to change their configurations to use the autoscaling framework to ensure continued operation of this feature in their Solr installations.

For users using this feature with the deprecated configuration, you can temporarily disable it cluster-wide by setting the cluster property `autoAddReplicas` to `false`, as in these examples:

.V1 API
[source,bash]
----
http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=autoAddReplicas&val=false
----

.V2 API
[source,bash]
----
curl -X POST -H 'Content-type: application/json' -d '{"set-property": {"name":"autoAddReplicas", "val":false}}' http://localhost:8983/api/cluster
----

Re-enable the feature by unsetting the `autoAddReplicas` cluster property. When no `val` parameter is provided, the cluster property is unset:

.V1 API
[source,bash]
----
http://localhost:8983/solr/admin/collections?action=CLUSTERPROP&name=autoAddReplicas
----

.V2 API
[source,bash]
----
curl -X POST -H 'Content-type: application/json' -d '{"set-property": {"name":"autoAddReplicas"}}' http://localhost:8983/api/cluster
----
====