lucene/solr/docker/Docker-FAQ.md

14 KiB

Docker Solr FAQ

How do I persist Solr data and config?

Your data is persisted already, in your container's filesystem. If you docker run, add data to Solr, then docker stop and later docker start, then your data is still there. The same is true for changes to configuration files.

Equally, if you docker commit your container, you can later create a new container from that image, and that will have your data in it.

For some use-cases it is convenient to provide a modified solr.in.sh file to Solr. For example to point Solr to a ZooKeeper host:

docker create --name my_solr -P solr
docker cp my_solr:/opt/solr/bin/solr.in.sh .
sed -i -e 's/#ZK_HOST=.*/ZK_HOST=cylon.lan:2181/' solr.in.sh
docker cp solr.in.sh my_solr:/opt/solr/bin/solr.in.sh
docker start my_solr
# With a browser go to http://cylon.lan:32873/solr/#/ and confirm "-DzkHost=cylon.lan:2181" in the JVM Args section.

But usually when people ask this question, what they are after is a way to store Solr data and config in a separate Docker Volume. That is explained in the next two questions.

How can I mount a host directory as a data volume?

This is useful if you want to inspect or modify the data in the Docker host when the container is not running, and later easily run new containers against that data. This is indeed possible, but there are a few gotchas.

Solr stores its core data in the server/solr directory, in sub-directories for each core. The server/solr directory also contains configuration files that are part of the Solr distribution. Now, if we mounted volumes for each core individually, then that would interfere with Solr trying to create those directories. If instead we make the whole directory a volume, then we need to provide those configuration files in our volume, which we can do by copying them from a temporary container. For example:

# create a directory to store the server/solr directory
$ mkdir /home/docker-volumes/mysolr1

# make sure its host owner matches the container's solr user
$ sudo chown 8983:8983 /home/docker-volumes/mysolr1

# copy the solr directory from a temporary container to the volume
$ docker run -it --rm -v /home/docker-volumes/mysolr1:/target apache/solr cp -r server/solr /target/

# pass the solr directory to a new container running solr
$ SOLR_CONTAINER=$(docker run -d -P -v /home/docker-volumes/mysolr1/solr:/opt/solr/server/solr apache/solr)

# create a new core
$ docker exec -it --user=solr $SOLR_CONTAINER solr create_core -c gettingstarted

# check the volume on the host:
$ ls /home/docker-volumes/mysolr1/solr/
configsets  gettingstarted  README.txt  solr.xml  zoo.cfg

Note that if you add or modify files in that directory from the host, you must chown 8983:8983 them.

How can I use a Data Volume Container?

You can avoid the concerns about UID mismatches above, by using data volumes only from containers. You can create a container with a volume, then point future containers at that same volume. This can be handy if you want to modify the solr image, for example if you want to add a program. By separating the data and the code, you can change the code and re-use the data.

But there are pitfalls:

  • if you remove the container that owns the volume, then you lose your data. Docker does not even warn you that a running container is dependent on it.
  • if you point multiple solr containers at the same volume, you will have multiple instances write to the same files, which will undoubtedly lead to corruption
  • if you do want to remove that volume, you must do docker rm -v containername; if you forget the -v there will be a dangling volume which you can not easily clean up.

Here is an example:

# create a container with a volume on the path that solr uses to store data.
docker create -v /opt/solr/server/solr --name mysolr1data solr /bin/true

# pass the volume to a new container running solr
SOLR_CONTAINER=$(docker run -d -P --volumes-from=mysolr1data apache/solr)

# create a new core
$ docker exec -it --user=solr $SOLR_CONTAINER solr create_core -c gettingstarted

# make a change to the config, using the config API
docker exec -it --user=solr $SOLR_CONTAINER curl http://localhost:8983/solr/gettingstarted/config -H 'Content-type:application/json' -d'{
    "set-property" : {"query.filterCache.autowarmCount":1000},
    "unset-property" :"query.filterCache.size"}'

# verify the change took effect
docker exec -it --user=solr $SOLR_CONTAINER curl http://localhost:8983/solr/gettingstarted/config/overlay?omitHeader=true

# stop the solr container
docker exec -it --user=solr $SOLR_CONTAINER bash -c 'cd server; java -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -jar start.jar --stop'

# create a new container
SOLR_CONTAINER=$(docker run -d -P --volumes-from=mysolr1data apache/solr)

# check our core is still there:
docker exec -it --user=solr $SOLR_CONTAINER ls server/solr/gettingstarted

# check the config modification is still there:
docker exec -it --user=solr $SOLR_CONTAINER curl http://localhost:8983/solr/gettingstarted/config/overlay?omitHeader=true

Can I use volumes with SOLR_HOME?

Solr supports a SOLR_HOME environment variable to point to a non-standard location of the Solr home directory. You can use this in Solr docker, in combination with volumes:

docker run -it -v $PWD/mysolrhome:/mysolrhome -e SOLR_HOME=/mysolrhome apache/solr

This does need a pre-configured directory at that location.

To make this easier, Solr docker supports a INIT_SOLR_HOME setting, which copies the contents from the default directory in the image to the SOLR_HOME (if it is empty).

mkdir mysolrhome
sudo chown 8983:8983 mysolrhome
docker run -it -v $PWD/mysolrhome:/mysolrhome -e SOLR_HOME=/mysolrhome -e INIT_SOLR_HOME=yes apache/solr

Note: If SOLR_HOME is set, the "solr-precreate" command will put the created core in the SOLR_HOME directory rather than the "mycores" directory.

Can I run ZooKeeper and Solr clusters under Docker?

At the network level the ZooKeeper nodes need to be able to talk to each other, and the Solr nodes need to be able to talk to the ZooKeeper nodes and to each other. At the application level, different nodes need to be able to identify and locate each other. In ZooKeeper that is done with a configuration file that lists hostnames or IP addresses for each node. In Solr that is done with a parameter that specifies a host or IP address, which is then stored in ZooKeeper.

In typical clusters, those hostnames/IP addresses are pre-defined and remain static through the lifetime of the cluster. In Docker, inter-container communication and multi-host networking can be facilitated by Docker Networks. But, crucially, Docker does not normally guarantee that IP addresses of containers remain static during the lifetime of a container. In non-networked Docker, the IP address seems to change every time you stop/start. In a networked Docker, containers can lose their IP address in certain sequences of starting/stopping, unless you take steps to prevent that.

IP changes cause problems:

  • If you use hardcoded IP addresses in configuration, and the addresses of your containers change after a stops/start, then your cluster will stop working and may corrupt itself.
  • If you use hostnames in configuration, and the addresses of your containers change, then you might run into problems with cached hostname lookups.
  • And if you use hostnames there is another problem: the names are not defined until the respective container is running, So when for example the first ZooKeeper node starts up, it will attempt a hostname lookup for the other nodes, and that will fail. This is especially a problem for ZooKeeper 3.4.6; future versions are better at recovering.

Docker 1.10 has a new --ip configuration option that allows you to specify an IP address for a container. It also has a --ip-range option that allows you to specify the range that other containers get addresses from. Used together, you can implement static addresses. See this example.

Docker's Legacy container links provide a way to pass connection configuration between containers. It only works on a single machine, on the default bridge. It provides no facilities for static IPs. Note: this feature is expected to be deprecated and removed in a future release. So really, see the "Can I run ZooKeeper and Solr clusters under Docker?" option above instead.

But for some use-cases, such as quick demos or one-shot automated testing, it can be convenient.

Run ZooKeeper, and define a name so we can link to it:

$ docker run --name zookeeper -d -p 2181:2181 -p 2888:2888 -p 3888:3888 jplock/zookeeper

Run two Solr nodes, linked to the zookeeper container:

$ docker run --name solr1 --link zookeeper:ZK -d -p 8983:8983 \
      apache/solr \
      bash -c 'solr start -f -z $ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_TCP_PORT'

$ docker run --name solr2 --link zookeeper:ZK -d -p 8984:8983 \
      apache/solr \
      bash -c 'solr start -f -z $ZK_PORT_2181_TCP_ADDR:$ZK_PORT_2181_TCP_PORT'

Create a collection:

$ docker exec -i -t solr1 solr create_collection \
        -c gettingstarted -shards 2 -p 8983

Then go to http://localhost:8983/solr/#/~cloud (adjust the hostname for your docker host) to see the two shards and Solr nodes.

How can I run ZooKeeper and Solr with Docker Compose?

See the docker compose example.

How can I get rid of "shared memory" warnings on Solr startup?

When starting the docker image you typically see these log lines:

OpenJDK 64-Bit Server VM warning: Failed to reserve shared memory. (error = 1)

If your set up can run without huge pages or you do not require it, the least-friction way to remove this warning is to disable large paging in the JVM via the environment variable:

SOLR_OPTS=-XX:-UseLargePages

In your Solr Admin UI, you will see listed under the JVM args both the original -XX:+UseLargePages set by the GC_TUNE environment variable and further down the list the overriding -XX:-UseLargePages argument.

I'm confused about the different invocations of solr -- help?

The different invocations of the Solr docker image can look confusing, because the name of the image is "apache/solr" and the Solr command is also "solr", and the image interprets various arguments in special ways. I'll illustrate the various invocations:

To run an arbitrary command in the image:

docker run -it apache/solr date

Here "apache/solr" is the name of the image, and "date" is the command. This does not invoke any Solr functionality.

To run the Solr server:

docker run -it apache/solr

Here "apache/solr" is the name of the image, and there is no specific command, so the image defaults to run the "solr" command with "-f" to run it in the foreground.

To run the Solr server with extra arguments:

docker run -it apache/solr -h myhostname

This is the same as the previous one, but an additional argument is passed. The image will run the "solr" command with "-f -h myhostname".

To run solr as an arbitrary command:

docker run -it apache/solr solr zk --help

Here the first "apache/solr" is the image name, and the second "solr" is the "solr" command. The image runs the command exactly as specified; no "-f" is implicitly added. The container will print help text, and exit.

If you find this visually confusing, it might be helpful to use more specific image tags, and specific command paths. For example:

docker run -it apache/solr:6 bin/solr -f -h myhostname

Finally, the Solr docker image offers several commands that do some work before then invoking the Solr server, like "solr-precreate" and "solr-demo". See the README.md for usage. These are implemented by the docker-entrypoint.sh script, and must be passed as the first argument to the image. For example:

docker run -it apache/solr:6 solr-demo

It's important to understand an implementation detail here. The Dockerfile uses solr-foreground as the CMD, and the docker-entrypoint.sh implements that by by running "solr -f". So these two are equivalent:

docker run -it apache/solr:6
docker run -it apache/solr:6 solr-foreground

whereas:

docker run -it apache/solr:6 solr -f

is slightly different: the "solr" there is a generic command, not treated in any special way by docker-entrypoint.sh. In particular, this means that the docker-entrypoint-initdb.d mechanism is not applied. So, if you want to use docker-entrypoint-initdb.d, then you must use one of the other two invocations. You also need to keep that in mind when you want to invoke solr from the bash command. For example, this does NOT run docker-entrypoint-initdb.d scripts:

docker run -it -v $PWD/set-heap.sh:/docker-entrypoint-initdb.d/set-heap.sh \
    apache/solr:6 bash -c "echo hello; solr -f"

but this does:

docker run -it $PWD/set-heap.sh:/docker-entrypoint-initdb.d/set-heap.sh \
    apache/solr:6 bash -c "echo hello; /opt/docker-solr/scripts/docker-entrypoint.sh solr-foreground"