YARN-8852. Add documentation for submarine installation details. (Zac Zhou via wangda)

Change-Id: If5681d1ef37ff5dc916735eeef15a6120173d653
(cherry picked from commit a23ea68b97)
This commit is contained in:
Wangda Tan 2018-10-09 10:18:00 -07:00
parent f6a73d181f
commit 8e8b748724
3 changed files with 1521 additions and 0 deletions

View File

@ -40,3 +40,7 @@ Click below contents if you want to understand more.
- [How to write Dockerfile for Submarine jobs](WriteDockerfile.html)
- [Developer guide](DeveloperGuide.html)
- [Installation guide](InstallationGuide.html)
- [Installation guide Chinese version](InstallationGuideChineseVersion.html)

View File

@ -0,0 +1,760 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Submarine Installation Guide
## Prerequisites
### Operating System
The operating system and kernel versions we used are as shown in the following table, which should be minimum required versions:
| Enviroment | Verion |
| ------ | ------ |
| Operating System | centos-release-7-3.1611.el7.centos.x86_64 |
| Kernal | 3.10.0-514.el7.x86_64 |
### User & Group
As there are some specific users and groups need to be created to install hadoop/docker. Please create them if they are missing.
```
adduser hdfs
adduser mapred
adduser yarn
addgroup hadoop
usermod -aG hdfs,hadoop hdfs
usermod -aG mapred,hadoop mapred
usermod -aG yarn,hadoop yarn
usermod -aG hdfs,hadoop hadoop
groupadd docker
usermod -aG docker yarn
usermod -aG docker hadoop
```
### GCC Version
Check the version of GCC tool
```bash
gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
# install if needed
yum install gcc make g++
```
### Kernel header & Kernel devel
```bash
# Approach 1
yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
# Approach 2
wget http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-514.el7.x86_64.rpm
rpm -ivh kernel-headers-3.10.0-514.el7.x86_64.rpm
```
### GPU Servers
```
lspci | grep -i nvidia
# If the server has gpus, you can get info like this
04:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
82:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
```
### Nvidia Driver Installation
If nvidia driver/cuda has been installed before, They should be uninstalled firstly.
```
# uninstall cuda
sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl
# uninstall nvidia-driver
sudo /usr/bin/nvidia-uninstall
```
To check GPU version, install nvidia-detect
```
yum install nvidia-detect
# run 'nvidia-detect -v' to get reqired nvidia driver version
nvidia-detect -v
Probing for supported NVIDIA devices...
[10de:13bb] NVIDIA Corporation GM107GL [Quadro K620]
This device requires the current 390.87 NVIDIA driver kmod-nvidia
[8086:1912] Intel Corporation HD Graphics 530
An Intel display controller was also detected
```
Pay attention to `This device requires the current 390.87 NVIDIA driver kmod-nvidia`.
Download the installer [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html).
Some preparatory work for nvidia driver installation
```
# It may take a while to update
yum -y update
yum -y install kernel-devel
yum -y install epel-release
yum -y install dkms
# Disable nouveau
vim /etc/default/grub
# Add the following configuration in “GRUB_CMDLINE_LINUX” part
rd.driver.blacklist=nouveau nouveau.modeset=0
# Generate configuration
grub2-mkconfig -o /boot/grub2/grub.cfg
vim /etc/modprobe.d/blacklist.conf
# Add confiuration:
blacklist nouveau
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)
reboot
```
Check whether nouveau is disabled
```
lsmod | grep nouveau # return null
# install nvidia driver
sh NVIDIA-Linux-x86_64-390.87.run
```
Some options during the installation
```
Install NVIDIA's 32-bit compatibility libraries (Yes)
centos Install NVIDIA's 32-bit compatibility libraries (Yes)
Would you like to run the nvidia-xconfig utility to automatically update your X configuration file... (NO)
```
Check nvidia driver installation
```
nvidia-smi
```
Reference
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
### Docker Installation
```
yum -y update
yum -y install yum-utils
yum-config-manager --add-repo https://yum.dockerproject.org/repo/main/centos/7
yum -y update
# Show available packages
yum search --showduplicates docker-engine
# Install docker 1.12.5
yum -y --nogpgcheck install docker-engine-1.12.5*
systemctl start docker
chown hadoop:netease /var/run/docker.sock
chown hadoop:netease /usr/bin/docker
```
Referencehttps://docs.docker.com/cs-engine/1.12/
### Docker Configuration
Add a file, named daemon.json, under the path of /etc/docker/. Please replace the variables of image_registry_ip, etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ip with specific ips according to your environments.
```
{
"insecure-registries": ["${image_registry_ip}:5000"],
"cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379",
"cluster-advertise":"{localhost_ip}:2375",
"dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip1}"],
"hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"]
}
```
Restart docker daemon
```
sudo systemctl restart docker
```
### Docker EE version
```bash
$ docker version
Client:
Version: 1.12.5
API version: 1.24
Go version: go1.6.4
Git commit: 7392c3b
Built: Fri Dec 16 02:23:59 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.5
API version: 1.24
Go version: go1.6.4
Git commit: 7392c3b
Built: Fri Dec 16 02:23:59 2016
OS/Arch: linux/amd64
```
### Nvidia-docker Installation
Submarine is based on nvidia-docker 1.0 version
```
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
sudo rpm -i /tmp/nvidia-docker*.rpm
# Start nvidia-docker
sudo systemctl start nvidia-docker
# Check nvidia-docker status
systemctl status nvidia-docker
# Check nvidia-docker log
journalctl -u nvidia-docker
# Test nvidia-docker-plugin
curl http://localhost:3476/v1.0/docker/cli
```
According to `nvidia-driver` version, add folders under the path of `/var/lib/nvidia-docker/volumes/nvidia_driver/`
```
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87
# 390.8 is nvidia driver version
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
cp /usr/bin/nvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
cp /usr/lib64/libcuda* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
cp /usr/lib64/libnvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
# Test with nvidia-smi
nvidia-docker run --rm nvidia/cuda:9.0-devel nvidia-smi
```
Test docker, nvidia-docker, nvidia-driver installation
```
# Test 1
nvidia-docker run -rm nvidia/cuda nvidia-smi
```
```
# Test 2
nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash
# In docker container
python
import tensorflow as tf
tf.test.is_gpu_available()
```
[The way to uninstall nvidia-docker 1.0](https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0))
Reference:
https://github.com/NVIDIA/nvidia-docker/tree/1.0
### Tensorflow Image
There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. we can get basic docker images by following WriteDockerfile.md.
The basic Dockerfile doesn't support kerberos security. if you need kerberos, you can get write a Dockerfile like this
```shell
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --allow-downgrades --no-install-recommends \
build-essential \
cuda-command-line-tools-9-0 \
cuda-cublas-9-0 \
cuda-cufft-9-0 \
cuda-curand-9-0 \
cuda-cusolver-9-0 \
cuda-cusparse-9-0 \
curl \
libcudnn7=7.0.5.15-1+cuda9.0 \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
python \
python-dev \
rsync \
software-properties-common \
unzip \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
RUN pip --no-cache-dir install \
Pillow \
h5py \
ipykernel \
jupyter \
matplotlib \
numpy \
pandas \
scipy \
sklearn \
&& \
python -m ipykernel.kernelspec
# Install TensorFlow GPU version.
RUN pip --no-cache-dir install \
http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp27-none-linux_x86_64.whl
RUN apt-get update && apt-get install git -y
RUN apt-get update && apt-get install -y openjdk-8-jdk wget
# Downloadhadoop-3.1.1.tar.gz
RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
RUN tar zxf hadoop-3.1.1.tar.gz
RUN mv hadoop-3.1.1 hadoop-3.1.0
# Download jdk which supports kerberos
RUN wget -qO jdk8.tar.gz 'http://${kerberos_jdk_url}/jdk-8u152-linux-x64.tar.gz'
RUN tar xzf jdk8.tar.gz -C /opt
RUN mv /opt/jdk* /opt/java
RUN rm jdk8.tar.gz
RUN update-alternatives --install /usr/bin/java java /opt/java/bin/java 100
RUN update-alternatives --install /usr/bin/javac javac /opt/java/bin/javac 100
ENV JAVA_HOME /opt/java
ENV PATH $PATH:$JAVA_HOME/bin
```
### Test tensorflow in a docker container
After docker image is built, we can check
tensorflow environments before submitting a yarn job.
```shell
$ docker run -it ${docker_image_name} /bin/bash
# >>> In the docker container
$ python
$ python >> import tensorflow as tf
$ python >> tf.__version__
```
If there are some errors, we could check the following configuration.
1. LD_LIBRARY_PATH environment variable
```
echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
```
2. The location of libcuda.so.1, libcuda.so
```
ls -l /usr/local/nvidia/lib64 | grep libcuda.so
```
### Etcd Installation
To install Etcd on specified servers, we can run Submarine/install.sh
```shell
$ ./Submarine/install.sh
# Etcd status
systemctl status Etcd.service
```
Check Etcd cluster health
```shell
$ etcdctl cluster-health
member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379
member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379
member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379
cluster is healthy
$ etcdctl member list
3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
```
### Calico Installation
To install Calico on specified servers, we can run Submarine/install.sh
```
systemctl start calico-node.service
systemctl status calico-node.service
```
#### Check Calico Network
```shell
# Run the following command to show the all host status in the cluster except localhost.
$ calicoctl node status
Calico process is running.
IPv4 BGP status
+---------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+------------+-------------+
| ${host_ip1} | node-to-node mesh | up | 2018-09-21 | Established |
| ${host_ip2} | node-to-node mesh | up | 2018-09-21 | Established |
| ${host_ip3} | node-to-node mesh | up | 2018-09-21 | Established |
+---------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.
```
Create containers to validate calico network
```
docker network create --driver calico --ipam-driver calico-ipam calico-network
docker run --net calico-network --name workload-A -tid busybox
docker run --net calico-network --name workload-B -tid busybox
docker exec workload-A ping workload-B
```
## Hadoop Installation
### Compile hadoop source code
```
mvn package -Pdist -DskipTests -Dtar
```
### Start yarn service
```
YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager
YARN_LOGFILE=nodemanager.log ./sbin/yarn-daemon.sh start nodemanager
YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver
```
### Start yarn registery dns service
```
sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
```
### Test with a MR wordcount job
```
./bin/hadoop jar /home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar wordcount /tmp/wordcount.txt /tmp/wordcount-output4
```
## Tensorflow Job with CPU
### Standalone Mode
#### Clean up apps with the same name
Suppose we want to submit a tensorflow job named standalone-tf, destroy any application with the same name and clean up historical job directories.
```bash
./bin/yarn app -destroy standalone-tf
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
```
where ${dfs_name_service} is the hdfs name service you use
#### Run a standalone tensorflow job
```bash
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name standalone-tf \
--docker_image dockerfile-cpu-tf1.8.0-with-models \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-checkpoint \
--worker_resources memory=4G,vcores=2 --verbose \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --num-gpus=0"
```
### Distributed Mode
#### Clean up apps with the same name
```bash
./bin/yarn app -destroy distributed-tf
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
```
#### Run a distributed tensorflow job
```bash
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--docker_image dockerfile-cpu-tf1.8.0-with-models \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--worker_resources memory=4G,vcores=2 --verbose \
--num_ps 1 \
--ps_resources memory=4G,vcores=2 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \
--num_workers 4 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0"
```
## Tensorflow Job with GPU
### GPU configurations for both resourcemanager and nodemanager
Add the yarn resource configuration file, named resource-types.xml
```
<configuration>
<property>
<name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
</configuration>
```
#### GPU configurations for resourcemanager
The scheduler used by resourcemanager must be capacity scheduler, and yarn.scheduler.capacity.resource-calculator in capacity-scheduler.xml should be DominantResourceCalculator
```
<configuration>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
</configuration>
```
#### GPU configurations for nodemanager
Add configurations in yarn-site.xml
```
<configuration>
<property>
<name>yarn.nodemanager.resource-plugins</name>
<value>yarn.io/gpu</value>
</property>
</configuration>
```
Add configurations in container-executor.cfg
```
[docker]
...
# Add configurations in `[docker]` part
# /usr/bin/nvidia-docker is the path of nvidia-docker command
# nvidia_driver_375.26 means that nvidia driver version is 375.26. nvidia-smi command can be used to check the version
docker.allowed.volume-drivers=/usr/bin/nvidia-docker
docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0
docker.allowed.ro-mounts=nvidia_driver_375.26
[gpu]
module.enabled=true
[cgroups]
# /sys/fs/cgroup is the cgroup mount destination
# /hadoop-yarn is the path yarn creates by default
root=/sys/fs/cgroup
yarn-hierarchy=/hadoop-yarn
```
#### Test with a tensorflow job
Distributed-shell + GPU + cgroup
```bash
./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--docker_image gpu-cuda9.0-tf1.8.0-with-models \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--num_ps 0 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 1 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"
```
## Issues:
### Issue 1: Fail to start nodemanager after system reboot
```
2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED
```
Solution: Grant user yarn the access to `/sys/fs/cgroup/cpu,cpuacct`, which is the subfolder of cgroup mount destination.
```
chown :yarn -R /sys/fs/cgroup/cpu,cpuacct
chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct
```
If GPUs are usedthe access to cgroup devices folder is neede as well
```
chown :yarn -R /sys/fs/cgroup/devices
chmod g+rwx -R /sys/fs/cgroup/devices
```
### Issue 2: container-executor permission denied
```
2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command:
java.io.IOException: Cannot run program "/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:938)
at org.apache.hadoop.util.Shell.run(Shell.java:901)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
```
Solution: The permission of `/etc/yarn/sbin/Linux-amd64-64/container-executor` should be 6050
### Issue 3How to get docker service log
Solution: we can get docker log with the following command
```
journalctl -u docker
```
### Issue 4docker can't remove containers with errors like `device or resource busy`
```bash
$ docker rm 0bfafa146431
Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy
```
Solution: to find which process leads to a `device or resource busy`, we can add a shell script, named `find-busy-mnt.sh`
```bash
#!/bin/bash
# A simple script to get information about mount points and pids and their
# mount namespaces.
if [ $# -ne 1 ];then
echo "Usage: $0 <devicemapper-device-id>"
exit 1
fi
ID=$1
MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null`
[ -z "$MOUNTS" ] && echo "No pids found" && exit 0
printf "PID\tNAME\t\tMNTNS\n"
echo "$MOUNTS" | while read LINE; do
PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3`
# Ignore self and thread-self
if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then
continue
fi
NAME=`ps -q $PID -o comm=`
MNTNS=`readlink /proc/$PID/ns/mnt`
printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS"
done
```
Kill the process by pid, which is found by the script
```bash
$ chmod +x find-busy-mnt.sh
./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a
# PID NAME MNTNS
# 5007 ntpd mnt:[4026533598]
$ kill -9 5007
```
### Issue 5Failed to execute `sudo nvidia-docker run`
```
docker: Error response from daemon: create nvidia_driver_361.42: VolumeDriver.Create: internal error, check logs for details.
See 'docker run --help'.
```
Solution:
```
#check nvidia-docker status
$ systemctl status nvidia-docker
$ journalctl -n -u nvidia-docker
#restart nvidia-docker
systemctl stop nvidia-docker
systemctl start nvidia-docker
```
### Issue 6Yarn failed to start containers
if the number of GPUs required by applications is larger than the number of GPUs in the cluster, there would be some containers can't be created.

View File

@ -0,0 +1,757 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Submarine 安装说明
## Prerequisites
### 操作系统
我们使用的操作系统版本是 centos-release-7-3.1611.el7.centos.x86_64, 内核版本是 3.10.0-514.el7.x86_64 ,应该是最低版本了。
| Enviroment | Verion |
| ------ | ------ |
| Operating System | centos-release-7-3.1611.el7.centos.x86_64 |
| Kernal | 3.10.0-514.el7.x86_64 |
### User & Group
如果操作系统中没有这些用户组和用户,必须添加。一部分用户是 hadoop 运行需要,一部分用户是 docker 运行需要。
```
adduser hdfs
adduser mapred
adduser yarn
addgroup hadoop
usermod -aG hdfs,hadoop hdfs
usermod -aG mapred,hadoop mapred
usermod -aG yarn,hadoop yarn
usermod -aG hdfs,hadoop hadoop
groupadd docker
usermod -aG docker yarn
usermod -aG docker hadoop
```
### GCC 版本
```bash
gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
# 如果没有安装请执行以下命令进行安装
yum install gcc make g++
```
### Kernel header & devel
```bash
# 方法一:
yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
# 方法二:
wget http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-514.el7.x86_64.rpm
rpm -ivh kernel-headers-3.10.0-514.el7.x86_64.rpm
```
### 检查 GPU 版本
```
lspci | grep -i nvidia
# 如果什么都没输出,就说明显卡不对,以下是我的输出:
# 04:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
# 82:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
```
### 安装 nvidia 驱动
安装nvidia driver/cuda要确保已安装的nvidia driver/cuda已被清理
```
# 卸载cuda
sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl
# 卸载nvidia-driver
sudo /usr/bin/nvidia-uninstall
```
安装nvidia-detect用于检查显卡版本
```
yum install nvidia-detect
# 运行命令 nvidia-detect -v 返回结果:
nvidia-detect -v
Probing for supported NVIDIA devices...
[10de:13bb] NVIDIA Corporation GM107GL [Quadro K620]
This device requires the current 390.87 NVIDIA driver kmod-nvidia
[8086:1912] Intel Corporation HD Graphics 530
An Intel display controller was also detected
```
注意这里的信息 [Quadro K620] 和390.87。
下载 [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html)
安装前的一系列准备工作
```
# 若系统很久没更新,这句可能耗时较长
yum -y update
yum -y install kernel-devel
yum -y install epel-release
yum -y install dkms
# 禁用nouveau
vim /etc/default/grub #在“GRUB_CMDLINE_LINUX”中添加内容 rd.driver.blacklist=nouveau nouveau.modeset=0
grub2-mkconfig -o /boot/grub2/grub.cfg # 生成配置
vim /etc/modprobe.d/blacklist.conf # 打开新建文件添加内容blacklist nouveau
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r) # 更新配置,并重启
reboot
```
开机后确认是否禁用
```
lsmod | grep nouveau # 应该返回空
# 开始安装
sh NVIDIA-Linux-x86_64-390.87.run
```
安装过程中,会遇到一些选项:
```
Install NVIDIA's 32-bit compatibility libraries (Yes)
centos Install NVIDIA's 32-bit compatibility libraries (Yes)
Would you like to run the nvidia-xconfig utility to automatically update your X configuration file... (NO)
```
最后查看 nvidia gpu 状态
```
nvidia-smi
```
reference
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
### 安装 Docker
```
yum -y update
yum -y install yum-utils
yum-config-manager --add-repo https://yum.dockerproject.org/repo/main/centos/7
yum -y update
# 显示 available 的安装包
yum search --showduplicates docker-engine
# 安装 1.12.5 版本 docker
yum -y --nogpgcheck install docker-engine-1.12.5*
systemctl start docker
chown hadoop:netease /var/run/docker.sock
chown hadoop:netease /usr/bin/docker
```
Referencehttps://docs.docker.com/cs-engine/1.12/
### 配置 Docker
`/etc/docker/` 目录下,创建`daemon.json`文件, 添加以下配置变量如image_registry_ip, etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ip需要根据具体环境进行修改
```
{
"insecure-registries": ["${image_registry_ip}:5000"],
"cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379",
"cluster-advertise":"{localhost_ip}:2375",
"dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip1}"],
"hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"]
}
```
重启 docker daemon
```
sudo systemctl restart docker
```
### Docker EE version
```bash
$ docker version
Client:
Version: 1.12.5
API version: 1.24
Go version: go1.6.4
Git commit: 7392c3b
Built: Fri Dec 16 02:23:59 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.5
API version: 1.24
Go version: go1.6.4
Git commit: 7392c3b
Built: Fri Dec 16 02:23:59 2016
OS/Arch: linux/amd64
```
### 安装 nvidia-docker
Hadoop-3.2 的 submarine 使用的是 1.0 版本的 nvidia-docker
```
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
sudo rpm -i /tmp/nvidia-docker*.rpm
# 启动 nvidia-docker
sudo systemctl start nvidia-docker
# 查看 nvidia-docker 状态:
systemctl status nvidia-docker
# 查看 nvidia-docker 日志:
journalctl -u nvidia-docker
# 查看 nvidia-docker-plugin 是否正常
curl http://localhost:3476/v1.0/docker/cli
```
`/var/lib/nvidia-docker/volumes/nvidia_driver/` 路径下,根据 `nvidia-driver` 的版本创建文件夹:
```
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87
# 其中390.87是nvidia driver的版本号
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
cp /usr/bin/nvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
cp /usr/lib64/libcuda* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
cp /usr/lib64/libnvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
# Test nvidia-smi
nvidia-docker run --rm nvidia/cuda:9.0-devel nvidia-smi
```
测试 docker, nvidia-docker, nvidia-driver 安装
```
# 测试一
nvidia-docker run -rm nvidia/cuda nvidia-smi
```
```
# 测试二
nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash
# 在docker中执行
python
import tensorflow as tf
tf.test.is_gpu_available()
```
卸载 nvidia-docker 1.0 的方法:
https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
reference:
https://github.com/NVIDIA/nvidia-docker/tree/1.0
### Tensorflow Image
CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Sumbmarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件基础的Dockfile可参见WriteDockerfile.md
上述images无法支持kerberos环境如果需要kerberos可以使用如下Dockfile
```shell
FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --allow-downgrades --no-install-recommends \
build-essential \
cuda-command-line-tools-9-0 \
cuda-cublas-9-0 \
cuda-cufft-9-0 \
cuda-curand-9-0 \
cuda-cusolver-9-0 \
cuda-cusparse-9-0 \
curl \
libcudnn7=7.0.5.15-1+cuda9.0 \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
python \
python-dev \
rsync \
software-properties-common \
unzip \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
RUN pip --no-cache-dir install \
Pillow \
h5py \
ipykernel \
jupyter \
matplotlib \
numpy \
pandas \
scipy \
sklearn \
&& \
python -m ipykernel.kernelspec
# Install TensorFlow GPU version.
RUN pip --no-cache-dir install \
http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp27-none-linux_x86_64.whl
RUN apt-get update && apt-get install git -y
RUN apt-get update && apt-get install -y openjdk-8-jdk wget
# 下载 hadoop-3.1.1.tar.gz
RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
RUN tar zxf hadoop-3.1.1.tar.gz
RUN mv hadoop-3.1.1 hadoop-3.1.0
# 下载支持kerberos的jdk安装包
RUN wget -qO jdk8.tar.gz 'http://${kerberos_jdk_url}/jdk-8u152-linux-x64.tar.gz'
RUN tar xzf jdk8.tar.gz -C /opt
RUN mv /opt/jdk* /opt/java
RUN rm jdk8.tar.gz
RUN update-alternatives --install /usr/bin/java java /opt/java/bin/java 100
RUN update-alternatives --install /usr/bin/javac javac /opt/java/bin/javac 100
ENV JAVA_HOME /opt/java
ENV PATH $PATH:$JAVA_HOME/bin
```
### 测试 TF 环境
创建好 docker 镜像后,需要先手动检查 TensorFlow 是否可以正常使用,避免通过 YARN 调度后出现问题,可以执行以下命令
```shell
$ docker run -it ${docker_image_name} /bin/bash
# >>> 进入容器
$ python
$ python >> import tensorflow as tf
$ python >> tf.__version__
```
如果出现问题,可以按照以下路径进行排查
1. 环境变量是否设置正确
```
echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
```
2. libcuda.so.1,libcuda.so是否在LD_LIBRARY_PATH指定的路径中
```
ls -l /usr/local/nvidia/lib64 | grep libcuda.so
```
### 安装 Etcd
运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Etcd 组件和服务自启动脚本。
```shell
$ ./Submarine/install.sh
# 通过如下命令查看 Etcd 服务状态
systemctl status Etcd.service
```
检查 Etcd 服务状态
```shell
$ etcdctl cluster-health
member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379
member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379
member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379
cluster is healthy
$ etcdctl member list
3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
```
其中,${etcd_host_ip*} 是etcd服务器的ip
### 安装 Calico
运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Calico 组件和服务自启动脚本。
```
systemctl start calico-node.service
systemctl status calico-node.service
```
#### 检查 Calico 网络
```shell
# 执行如下命令,注意:不会显示本服务器的状态,只显示其他的服务器状态
$ calicoctl node status
Calico process is running.
IPv4 BGP status
+---------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+------------+-------------+
| ${host_ip1} | node-to-node mesh | up | 2018-09-21 | Established |
| ${host_ip2} | node-to-node mesh | up | 2018-09-21 | Established |
| ${host_ip3} | node-to-node mesh | up | 2018-09-21 | Established |
+---------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.
```
创建docker container验证calico网络
```
docker network create --driver calico --ipam-driver calico-ipam calico-network
docker run --net calico-network --name workload-A -tid busybox
docker run --net calico-network --name workload-B -tid busybox
docker exec workload-A ping workload-B
```
## 安装 Hadoop
### 编译 Hadoop
```
mvn package -Pdist -DskipTests -Dtar
```
### 启动 YARN服务
```
YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager
YARN_LOGFILE=nodemanager.log ./sbin/yarn-daemon.sh start nodemanager
YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver
```
### 启动 registery dns 服务
```
sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
```
### 测试 wordcount
通过测试最简单的 wordcount ,检查 YARN 是否正确安装
```
./bin/hadoop jar /home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar wordcount /tmp/wordcount.txt /tmp/wordcount-output4
```
## 使用CUP的Tensorflow任务
### 单机模式
#### 清理重名程序
```bash
# 每次提交前需要执行:
./bin/yarn app -destroy standalone-tf
# 并删除hdfs路径
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
# 确保之前的任务已经结束
```
其中,变量${dfs_name_service}请根据环境用你的hdfs name service名称替换
#### 执行单机模式的tensorflow任务
```bash
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name standalone-tf \
--docker_image dockerfile-cpu-tf1.8.0-with-models \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-checkpoint \
--worker_resources memory=4G,vcores=2 --verbose \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --num-gpus=0"
```
### 分布式模式
#### 清理重名程序
```bash
# 每次提交前需要执行:
./bin/yarn app -destroy distributed-tf
# 并删除hdfs路径
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
# 确保之前的任务已经结束
```
#### 提交分布式模式 tensorflow 任务
```bash
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--docker_image dockerfile-cpu-tf1.8.0-with-models \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--worker_resources memory=4G,vcores=2 --verbose \
--num_ps 1 \
--ps_resources memory=4G,vcores=2 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \
--num_workers 4 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${${dfs_name_service}}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0"
```
## 使用GPU的Tensorflow任务
### Resourcemanager, Nodemanager 中添加GPU支持
在 yarn 配置文件夹(conf或etc/hadoop)中创建 resource-types.xml添加
```
<configuration>
<property>
<name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
</configuration>
```
### Resourcemanager 的 GPU 配置
resourcemanager 使用的 scheduler 必须是 capacity scheduler在 capacity-scheduler.xml 中修改属性:
```
<configuration>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
</configuration>
```
### Nodemanager 的 GPU 配置
在 nodemanager 的 yarn-site.xml 中添加配置:
```
<configuration>
<property>
<name>yarn.nodemanager.resource-plugins</name>
<value>yarn.io/gpu</value>
</property>
</configuration>
```
在 container-executor.cfg 中添加配置:
```
[docker]
...
# 在[docker]已有配置中,添加以下内容:
# /usr/bin/nvidia-docker是nvidia-docker路径
# nvidia_driver_375.26的版本号375.26可以使用nvidia-smi查看
docker.allowed.volume-drivers=/usr/bin/nvidia-docker
docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0
docker.allowed.ro-mounts=nvidia_driver_375.26
[gpu]
module.enabled=true
[cgroups]
# /sys/fs/cgroup是cgroup的mount路径
# /hadoop-yarn是yarn在cgroup路径下默认创建的path
root=/sys/fs/cgroup
yarn-hierarchy=/hadoop-yarn
```
### 提交验证
Distributed-shell + GPU + cgroup
```bash
./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--docker_image gpu-cuda9.0-tf1.8.0-with-models \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--num_ps 0 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 1 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"
```
## 问题
### 问题一: 操作系统重启导致 nodemanager 启动失败
```
2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED
```
解决方法:使用 `root` 账号给 `yarn` 用户修改 `/sys/fs/cgroup/cpu,cpuacct` 的权限
```
chown :yarn -R /sys/fs/cgroup/cpu,cpuacct
chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct
```
在支持gpu时还需cgroup devices路径权限
```
chown :yarn -R /sys/fs/cgroup/devices
chmod g+rwx -R /sys/fs/cgroup/devices
```
### 问题二container-executor 权限问题
```
2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command:
java.io.IOException: Cannot run program "/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:938)
at org.apache.hadoop.util.Shell.run(Shell.java:901)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
```
`/etc/yarn/sbin/Linux-amd64-64/container-executor` 该文件的权限应为6050
### 问题三:查看系统服务启动日志
```
journalctl -u docker
```
### 问题四Docker 无法删除容器的问题 `device or resource busy`
```bash
$ docker rm 0bfafa146431
Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy
```
编写 `find-busy-mnt.sh` 脚本,检查 `device or resource busy` 状态的容器挂载文件
```bash
#!/bin/bash
# A simple script to get information about mount points and pids and their
# mount namespaces.
if [ $# -ne 1 ];then
echo "Usage: $0 <devicemapper-device-id>"
exit 1
fi
ID=$1
MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null`
[ -z "$MOUNTS" ] && echo "No pids found" && exit 0
printf "PID\tNAME\t\tMNTNS\n"
echo "$MOUNTS" | while read LINE; do
PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3`
# Ignore self and thread-self
if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then
continue
fi
NAME=`ps -q $PID -o comm=`
MNTNS=`readlink /proc/$PID/ns/mnt`
printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS"
done
```
查找占用目录的进程
```bash
$ chmod +x find-busy-mnt.sh
./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a
# PID NAME MNTNS
# 5007 ntpd mnt:[4026533598]
$ kill -9 5007
```
### 问题五命令sudo nvidia-docker run 报错
```
docker: Error response from daemon: create nvidia_driver_361.42: VolumeDriver.Create: internal error, check logs for details.
See 'docker run --help'.
```
解决方法:
```
#查看nvidia-docker状态是不是启动有问题可以使用
$ systemctl status nvidia-docker
$ journalctl -n -u nvidia-docker
#重启下nvidia-docker
systemctl stop nvidia-docker
systemctl start nvidia-docker
```
### 问题六YARN 启动容器失败
如果你创建的容器数PS+Work>GPU显卡总数可能会出现容器创建失败那是因为在一台服务器上同时创建了超过本机显卡总数的容器。