SUBMARINE-82. Fix english grammar mistakes in documentation. Contributed by Szilard Nemeth.
(cherry picked from commit 799115967d
)
This commit is contained in:
parent
c7924cccb1
commit
c177cc9750
|
@ -14,7 +14,7 @@
|
|||
|
||||
# Examples
|
||||
|
||||
Here're some examples about Submarine usage.
|
||||
Here are some examples about how to use Submarine:
|
||||
|
||||
[Running Distributed CIFAR 10 Tensorflow Job](RunningDistributedCifar10TFJobs.html)
|
||||
|
||||
|
|
|
@ -14,23 +14,23 @@
|
|||
|
||||
# How to Install Dependencies
|
||||
|
||||
Submarine project uses YARN Service, Docker container, and GPU (when GPU hardware available and properly configured).
|
||||
Submarine project uses YARN Service, Docker container and GPU.
|
||||
GPU could only be used if a GPU hardware is available and properly configured.
|
||||
|
||||
That means as an admin, you have to properly setup YARN Service related dependencies, including:
|
||||
As an administrator, you have to properly setup YARN Service related dependencies, including:
|
||||
- YARN Registry DNS
|
||||
- Docker related dependencies, including:
|
||||
- Docker binary with expected versions
|
||||
- Docker network that allows Docker containers to talk to each other across different nodes
|
||||
|
||||
Docker related dependencies, including:
|
||||
- Docker binary with expected versions.
|
||||
- Docker network which allows Docker container can talk to each other across different nodes.
|
||||
If you would like to use GPU, you need to set up:
|
||||
- GPU Driver
|
||||
- Nvidia-docker
|
||||
|
||||
And when GPU wanna to be used:
|
||||
- GPU Driver.
|
||||
- Nvidia-docker.
|
||||
|
||||
For your convenience, we provided installation documents to help you to setup your environment. You can always choose to have them installed in your own way.
|
||||
For your convenience, we provided some installation documents to help you setup your environment. You can always choose to have them installed in your own way.
|
||||
|
||||
Use Submarine installer to install dependencies: [EN](https://github.com/hadoopsubmarine/hadoop-submarine-ecosystem/tree/master/submarine-installer) [CN](https://github.com/hadoopsubmarine/hadoop-submarine-ecosystem/blob/master/submarine-installer/README-CN.md)
|
||||
|
||||
Alternatively, you can follow manual install dependencies: [EN](InstallationGuide.html) [CN](InstallationGuideChineseVersion.html)
|
||||
Alternatively, you can follow this guide to manually install dependencies: [EN](InstallationGuide.html) [CN](InstallationGuideChineseVersion.html)
|
||||
|
||||
Once you have installed dependencies, please follow following guide to [TestAndTroubleshooting](TestAndTroubleshooting.html).
|
||||
Once you have installed all the dependencies, please follow this guide: [TestAndTroubleshooting](TestAndTroubleshooting.html).
|
|
@ -21,20 +21,20 @@ Goals of Submarine:
|
|||
|
||||
- Can launch services to serve Tensorflow/MXNet models.
|
||||
|
||||
- Support run distributed Tensorflow jobs with simple configs.
|
||||
- Supports running distributed Tensorflow jobs with simple configs.
|
||||
|
||||
- Support run standalone PyTorch jobs with simple configs.
|
||||
- Supports running standalone PyTorch jobs with simple configs.
|
||||
|
||||
- Support run user-specified Docker images.
|
||||
- Supports running user-specified Docker images.
|
||||
|
||||
- Support specify GPU and other resources.
|
||||
- Supports specifying GPU and other resources.
|
||||
|
||||
- Support launch tensorboard for training jobs if user specified.
|
||||
- Supports launching Tensorboard for training jobs (optional, if specified).
|
||||
|
||||
- Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
|
||||
- Supports customized DNS name for roles (like tensorboard.$user.$domain:6006)
|
||||
|
||||
|
||||
Click below contents if you want to understand more.
|
||||
If you want to deep-dive, please check these resources:
|
||||
|
||||
- [QuickStart Guide](QuickStart.html)
|
||||
|
||||
|
|
|
@ -16,20 +16,25 @@
|
|||
|
||||
## Prerequisites
|
||||
|
||||
(Please note that all following prerequisites are just an example for you to install. You can always choose to install your own version of kernel, different users, different drivers, etc.).
|
||||
Please note that the following prerequisites are just an example for you to install Submarine.
|
||||
|
||||
You can always choose to install your own version of kernel, different users, different drivers, etc.
|
||||
|
||||
### Operating System
|
||||
|
||||
The operating system and kernel versions we have tested are as shown in the following table, which is the recommneded minimum required versions.
|
||||
The operating system and kernel versions we have tested against are shown in the following table.
|
||||
The versions in the table are the recommended minimum required versions.
|
||||
|
||||
| Enviroment | Verion |
|
||||
| Environment | Version |
|
||||
| ------ | ------ |
|
||||
| Operating System | centos-release-7-5.1804.el7.centos.x86_64 |
|
||||
| Kernal | 3.10.0-862.el7.x86_64 |
|
||||
| Kernel | 3.10.0-862.el7.x86_64 |
|
||||
|
||||
### User & Group
|
||||
|
||||
As there are some specific users and groups recommended to be created to install hadoop/docker. Please create them if they are missing.
|
||||
There are specific users and groups recommended to be created to install Hadoop with Docker.
|
||||
|
||||
Please create these users if they do not exist.
|
||||
|
||||
```
|
||||
adduser hdfs
|
||||
|
@ -80,7 +85,9 @@ lspci | grep -i nvidia
|
|||
|
||||
### Nvidia Driver Installation (Only for Nvidia GPU equipped nodes)
|
||||
|
||||
To make a clean installation, if you have requirements to upgrade GPU drivers. If nvidia driver/cuda has been installed before, They should be uninstalled firstly.
|
||||
To make a clean installation, if you have requirements to upgrade GPU drivers.
|
||||
|
||||
If nvidia driver / CUDA has been installed before, they should be uninstalled as a first step.
|
||||
|
||||
```
|
||||
# uninstall cuda:
|
||||
|
@ -90,7 +97,7 @@ sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl
|
|||
sudo /usr/bin/nvidia-uninstall
|
||||
```
|
||||
|
||||
To check GPU version, install nvidia-detect
|
||||
To check GPU version, install nvidia-detect:
|
||||
|
||||
```
|
||||
yum install nvidia-detect
|
||||
|
@ -107,7 +114,9 @@ Pay attention to `This device requires the current xyz.nm NVIDIA driver kmod-nvi
|
|||
Download the installer like [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html).
|
||||
|
||||
|
||||
Some preparatory work for nvidia driver installation. (This is follow normal Nvidia GPU driver installation, just put here for your convenience)
|
||||
Some preparatory work for Nvidia driver installation.
|
||||
|
||||
The steps below are for Nvidia GPU driver installation, just pasted here for your convenience.
|
||||
|
||||
```
|
||||
# It may take a while to update
|
||||
|
@ -152,7 +161,7 @@ Would you like to run the nvidia-xconfig utility to automatically update your X
|
|||
```
|
||||
|
||||
|
||||
Check nvidia driver installation
|
||||
Check Nvidia driver installation
|
||||
|
||||
```
|
||||
nvidia-smi
|
||||
|
@ -165,7 +174,7 @@ https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
|
|||
|
||||
### Docker Installation
|
||||
|
||||
The following steps show how to install docker 18.06.1.ce. You can choose other approaches to install Docker.
|
||||
The following steps show you how to install docker 18.06.1.ce. You can choose other approaches to install Docker.
|
||||
|
||||
```
|
||||
# Remove old version docker
|
||||
|
@ -205,7 +214,9 @@ Reference:https://docs.docker.com/install/linux/docker-ce/centos/
|
|||
|
||||
### Docker Configuration
|
||||
|
||||
Add a file, named daemon.json, under the path of /etc/docker/. Please replace the variables of image_registry_ip, etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ip with specific ips according to your environments.
|
||||
Add a file, named daemon.json, under the path of /etc/docker/.
|
||||
|
||||
Please replace the variables of image_registry_ip, etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ip with specific IPs according to your environment.
|
||||
|
||||
```
|
||||
{
|
||||
|
@ -294,7 +305,7 @@ import tensorflow as tf
|
|||
tf.test.is_gpu_available()
|
||||
```
|
||||
|
||||
The way to uninstall nvidia-docker V2
|
||||
If you want to uninstall nvidia-docker V2:
|
||||
```
|
||||
sudo yum remove -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
|
||||
```
|
||||
|
@ -304,12 +315,14 @@ https://github.com/NVIDIA/nvidia-docker
|
|||
|
||||
### Tensorflow Image
|
||||
|
||||
There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. We can get basic docker images by referring to [Write Dockerfile](WriteDockerfileTF.html).
|
||||
There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images.
|
||||
|
||||
We can get or build basic docker images by referring to [Write Dockerfile](WriteDockerfileTF.html).
|
||||
|
||||
### Test tensorflow in a docker container
|
||||
|
||||
After docker image is built, we can check
|
||||
Tensorflow environments before submitting a yarn job.
|
||||
Tensorflow environments before submitting a Submarine job.
|
||||
|
||||
```shell
|
||||
$ docker run -it ${docker_image_name} /bin/bash
|
||||
|
@ -336,8 +349,8 @@ If there are some errors, we could check the following configuration.
|
|||
|
||||
### Etcd Installation
|
||||
|
||||
etcd is a distributed reliable key-value store for the most critical data of a distributed system, Registration and discovery of services used in containers.
|
||||
You can also choose alternatives like zookeeper, Consul.
|
||||
etcd is a distributed, reliable key-value store for the most critical data of a distributed system, Registration and discovery of services used in containers.
|
||||
You can also choose alternatives like ZooKeeper, Consul or others.
|
||||
|
||||
To install Etcd on specified servers, we can run Submarine-installer/install.sh
|
||||
|
||||
|
@ -366,8 +379,10 @@ b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURL
|
|||
|
||||
### Calico Installation
|
||||
|
||||
Calico creates and manages a flat three-tier network, and each container is assigned a routable ip. We just add the steps here for your convenience.
|
||||
You can also choose alternatives like Flannel, OVS.
|
||||
Calico creates and manages a flat three-tier network, and each container is assigned a routable IP address.
|
||||
|
||||
We are listing the steps here for your convenience.
|
||||
You can also choose alternatives like Flannel, OVS or others.
|
||||
|
||||
To install Calico on specified servers, we can run Submarine-installer/install.sh
|
||||
|
||||
|
@ -379,7 +394,7 @@ systemctl status calico-node.service
|
|||
#### Check Calico Network
|
||||
|
||||
```shell
|
||||
# Run the following command to show the all host status in the cluster except localhost.
|
||||
# Run the following command to show all host status in the cluster except localhost.
|
||||
$ calicoctl node status
|
||||
Calico process is running.
|
||||
|
||||
|
@ -412,7 +427,7 @@ docker exec workload-A ping workload-B
|
|||
You can either get Hadoop release binary or compile from source code. Please follow the https://hadoop.apache.org/ guides.
|
||||
|
||||
|
||||
### Start yarn service
|
||||
### Start YARN service
|
||||
|
||||
```
|
||||
YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager
|
||||
|
@ -421,7 +436,7 @@ YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
|
|||
YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver
|
||||
```
|
||||
|
||||
### Start yarn registery dns service
|
||||
### Start YARN registry DNS service
|
||||
|
||||
```
|
||||
sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
|
||||
|
@ -441,13 +456,13 @@ sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
|
|||
|
||||
#### Clean up apps with the same name
|
||||
|
||||
Suppose we want to submit a tensorflow job named standalone-tf, destroy any application with the same name and clean up historical job directories.
|
||||
Suppose we want to submit a TensorFlow job named standalone-tf, destroy any application with the same name and clean up historical job directories.
|
||||
|
||||
```bash
|
||||
./bin/yarn app -destroy standalone-tf
|
||||
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
|
||||
```
|
||||
where ${dfs_name_service} is the hdfs name service you use
|
||||
where ${dfs_name_service} is the HDFS name service you use
|
||||
|
||||
#### Run a standalone tensorflow job
|
||||
|
||||
|
@ -471,7 +486,7 @@ where ${dfs_name_service} is the hdfs name service you use
|
|||
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
|
||||
```
|
||||
|
||||
#### Run a distributed tensorflow job
|
||||
#### Run a distributed TensorFlow job
|
||||
|
||||
```bash
|
||||
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
|
||||
|
@ -490,11 +505,11 @@ where ${dfs_name_service} is the hdfs name service you use
|
|||
```
|
||||
|
||||
|
||||
## Tensorflow Job with GPU
|
||||
## TensorFlow Job with GPU
|
||||
|
||||
### GPU configurations for both resourcemanager and nodemanager
|
||||
### GPU configurations for both ResourceManager and NodeManager
|
||||
|
||||
Add the yarn resource configuration file, named resource-types.xml
|
||||
Add the YARN resource configuration file, named resource-types.xml
|
||||
|
||||
```
|
||||
<configuration>
|
||||
|
@ -505,9 +520,9 @@ Add the yarn resource configuration file, named resource-types.xml
|
|||
</configuration>
|
||||
```
|
||||
|
||||
#### GPU configurations for resourcemanager
|
||||
#### GPU configurations for ResourceManager
|
||||
|
||||
The scheduler used by resourcemanager must be capacity scheduler, and yarn.scheduler.capacity.resource-calculator in capacity-scheduler.xml should be DominantResourceCalculator
|
||||
The scheduler used by ResourceManager must be the capacity scheduler, and yarn.scheduler.capacity.resource-calculator in capacity-scheduler.xml should be DominantResourceCalculator
|
||||
|
||||
```
|
||||
<configuration>
|
||||
|
@ -518,7 +533,7 @@ The scheduler used by resourcemanager must be capacity scheduler, and yarn.sche
|
|||
</configuration>
|
||||
```
|
||||
|
||||
#### GPU configurations for nodemanager
|
||||
#### GPU configurations for NodeManager
|
||||
|
||||
Add configurations in yarn-site.xml
|
||||
|
||||
|
@ -536,7 +551,7 @@ Add configurations in yarn-site.xml
|
|||
</configuration>
|
||||
```
|
||||
|
||||
Add configurations in container-executor.cfg
|
||||
Add configurations to container-executor.cfg
|
||||
|
||||
```
|
||||
[docker]
|
||||
|
@ -560,7 +575,7 @@ Add configurations in container-executor.cfg
|
|||
yarn-hierarchy=/hadoop-yarn
|
||||
```
|
||||
|
||||
### Run a distributed tensorflow gpu job
|
||||
### Run a distributed TensorFlow GPU job
|
||||
|
||||
```bash
|
||||
./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
|
||||
|
|
|
@ -22,9 +22,9 @@ Must:
|
|||
|
||||
Optional:
|
||||
|
||||
- Enable YARN DNS. (When yarn service runtime is required.)
|
||||
- Enable GPU on YARN support. (When GPU-based training is required.)
|
||||
- Docker images for Submarine jobs. (When docker container is required.)
|
||||
- Enable YARN DNS. (Only when YARN Service runtime is required)
|
||||
- Enable GPU on YARN support. (When GPU-based training is required)
|
||||
- Docker images for Submarine jobs. (When docker container is required)
|
||||
```
|
||||
# Get prebuilt docker images (No liability)
|
||||
docker pull hadoopsubmarine/tf-1.13.1-gpu:0.0.1
|
||||
|
@ -121,7 +121,7 @@ usage: job run
|
|||
#### Notes:
|
||||
When using `localization` option to make a collection of dependency Python
|
||||
scripts available to entry python script in the container, you may also need to
|
||||
set `PYTHONPATH` environment variable as below to avoid module import error
|
||||
set the `PYTHONPATH` environment variable as below to avoid module import errors
|
||||
reported from `entry_script.py`.
|
||||
|
||||
```
|
||||
|
@ -137,7 +137,7 @@ reported from `entry_script.py`.
|
|||
|
||||
### Submarine Configuration
|
||||
|
||||
For Submarine internal configuration, please create a `submarine.xml` which should be placed under `$HADOOP_CONF_DIR`.
|
||||
For Submarine internal configuration, please create a `submarine.xml` file which should be placed under `$HADOOP_CONF_DIR`.
|
||||
|
||||
|Configuration Name | Description |
|
||||
|:---- |:---- |
|
||||
|
@ -157,7 +157,7 @@ yarn jar path-to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job run \
|
|||
--docker_image <your-docker-image> \
|
||||
--input_path hdfs://default/dataset/cifar-10-data \
|
||||
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
|
||||
--worker_resources memory=4G,vcores=2,gpu=2 \
|
||||
--worker_resources memory=4G,vcores=2,gpu=2 \
|
||||
--worker_launch_cmd "python ... (Your training application cmd)" \
|
||||
--tensorboard # this will launch a companion tensorboard container for monitoring
|
||||
```
|
||||
|
@ -168,11 +168,13 @@ yarn jar path-to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job run \
|
|||
|
||||
2) `DOCKER_HADOOP_HDFS_HOME` points to HADOOP_HDFS_HOME inside Docker image.
|
||||
|
||||
3) `--worker_resources` can include gpu when you need GPU to train your task.
|
||||
3) `--worker_resources` can include GPU when you need GPU to train your task.
|
||||
|
||||
4) When `--tensorboard` is specified, you can go to YARN new UI, go to services -> `<you specified service>` -> Click `...` to access Tensorboard.
|
||||
|
||||
This will launch a Tensorboard to monitor *all your jobs*. By access YARN UI (the new UI). You can go to services page, go to the `tensorboard-service`, click quick links (`Tensorboard`) can lead you to the tensorboard.
|
||||
This will launch Tensorboard to monitor *all your jobs*.
|
||||
By access the YARN UI (new UI), you can go to the Services page, then go to the `tensorboard-service`, click quick links (`Tensorboard`)
|
||||
This will lead you to Tensorboard.
|
||||
|
||||
See below screenshot:
|
||||
|
||||
|
@ -229,7 +231,7 @@ java -cp /path-to/hadoop-conf:/path-to/hadoop-submarine-all-*.jar \
|
|||
|
||||
#### Notes:
|
||||
|
||||
1) Very similar to standalone TF application, but you need to specify #worker/#ps
|
||||
1) Very similar to standalone TF application, but you need to specify number of workers / PS processes.
|
||||
|
||||
2) Different resources can be specified for worker and PS.
|
||||
|
||||
|
@ -283,22 +285,23 @@ java -cp /path-to/hadoop-conf:/path-to/hadoop-submarine-all-*.jar \
|
|||
--num_workers 0 --tensorboard
|
||||
```
|
||||
|
||||
You can view multiple job training history like from the `Tensorboard` link:
|
||||
You can view multiple job training history from the `Tensorboard` link:
|
||||
|
||||
![alt text](./images/multiple-tensorboard-jobs.png "Tensorboard for multiple jobs")
|
||||
|
||||
|
||||
### Get component logs from a training job
|
||||
|
||||
There're two ways to get training job logs, one is from YARN UI (new or old):
|
||||
There are two ways to get the logs of a training job.
|
||||
First, from YARN UI (new or old):
|
||||
|
||||
![alt text](./images/job-logs-ui.png "Job logs UI")
|
||||
|
||||
Or you can use `yarn logs -applicationId <applicationId>` to get logs from CLI
|
||||
Alternatively, you can use `yarn logs -applicationId <applicationId>` to get logs from CLI.
|
||||
|
||||
## Build from source code
|
||||
|
||||
If you want to build the Submarine project by yourself, you can follow the steps:
|
||||
If you want to build the Submarine project by yourself, you should follow these steps:
|
||||
|
||||
- Run 'mvn install -DskipTests' from Hadoop source top level once.
|
||||
|
||||
|
|
|
@ -18,7 +18,7 @@
|
|||
|
||||
## Prepare data for training
|
||||
|
||||
CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset.
|
||||
CIFAR-10 is a common benchmark in machine learning for image recognition. The example below is based on CIFAR-10 dataset.
|
||||
|
||||
1) Checkout https://github.com/tensorflow/models/:
|
||||
```
|
||||
|
@ -41,7 +41,7 @@ hadoop fs -put cifar-10-data/ /dataset/cifar-10-data
|
|||
|
||||
**Warning:**
|
||||
|
||||
Please note that YARN service doesn't allow multiple services with the same name, so please run following command
|
||||
Please note that YARN service does not allow multiple services with the same name, so please run following command
|
||||
```
|
||||
yarn application -destroy <service-name>
|
||||
```
|
||||
|
@ -59,8 +59,8 @@ Refer to [Write Dockerfile](WriteDockerfileTF.html) to build a Docker image or u
|
|||
yarn jar path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
|
||||
job run --name tf-job-001 --verbose --docker_image tf-1.13.1-gpu:0.0.1 \
|
||||
--input_path hdfs://default/dataset/cifar-10-data \
|
||||
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
|
||||
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current
|
||||
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
|
||||
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
|
||||
--num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
|
||||
--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" \
|
||||
--tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1
|
||||
|
@ -69,7 +69,7 @@ yarn jar path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
|
|||
Explanations:
|
||||
|
||||
- When access of HDFS is required, the two environments are required to indicate: JAVA_HOME and HDFS_HOME to access libhdfs libraries *inside Docker image*. We will try to eliminate specifying this in the future.
|
||||
- Docker image for worker and tensorboard can be specified separately. For this case, Tensorboard doesn't need GPU, so we will use cpu Docker image for Tensorboard. (Same for parameter-server in the distributed example below).
|
||||
- Docker image for worker and tensorboard can be specified separately. For this case, Tensorboard does not need GPU, so we will use the CPU Docker image for Tensorboard. (Same for parameter-server in the distributed example below).
|
||||
|
||||
### Run distributed training
|
||||
|
||||
|
@ -77,7 +77,7 @@ Explanations:
|
|||
yarn jar path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
|
||||
job run --name tf-job-001 --verbose --docker_image tf-1.13.1-gpu:0.0.1 \
|
||||
--input_path hdfs://default/dataset/cifar-10-data \
|
||||
--env(s) (same as standalone)
|
||||
--env(s) (same as standalone) \
|
||||
--num_workers 2 \
|
||||
--worker_resources memory=8G,vcores=2,gpu=1 \
|
||||
--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" \
|
||||
|
@ -90,7 +90,7 @@ yarn jar path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
|
|||
Explanations:
|
||||
|
||||
- `>1` num_workers indicates it is a distributed training.
|
||||
- Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server doesn't require GPU.
|
||||
- Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server does not require GPU.
|
||||
|
||||
For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page!
|
||||
|
||||
|
@ -150,7 +150,7 @@ INFO:tensorflow:Average examples/sec: 54.1082 (55.2134), step = 50
|
|||
INFO:tensorflow:Average examples/sec: 54.3141 (55.3676), step = 60
|
||||
```
|
||||
|
||||
Sample output of ps:
|
||||
Sample output of PS:
|
||||
```
|
||||
...
|
||||
, '_tf_random_seed': None, '_task_type': u'ps', '_environment': u'cloud', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4be54dff90>, '_tf_config': gpu_options {
|
||||
|
|
|
@ -37,7 +37,7 @@ Distributed-shell + GPU + cgroup
|
|||
|
||||
## Issues:
|
||||
|
||||
### Issue 1: Fail to start nodemanager after system reboot
|
||||
### Issue 1: Fail to start NodeManager after system reboot
|
||||
|
||||
```
|
||||
2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems!
|
||||
|
@ -62,7 +62,7 @@ chown :yarn -R /sys/fs/cgroup/cpu,cpuacct
|
|||
chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct
|
||||
```
|
||||
|
||||
If GPUs are used,the access to cgroup devices folder is neede as well
|
||||
If GPUs are used, access to cgroup devices folder is required as well.
|
||||
|
||||
```
|
||||
chown :yarn -R /sys/fs/cgroup/devices
|
||||
|
@ -140,7 +140,7 @@ $ chmod +x find-busy-mnt.sh
|
|||
$ kill -9 5007
|
||||
```
|
||||
|
||||
### Issue 5:Yarn failed to start containers
|
||||
### Issue 5:YARN fails to start containers
|
||||
|
||||
if the number of GPUs required by applications is larger than the number of GPUs in the cluster, there would be some containers can't be created.
|
||||
If the number of GPUs required by an application is greater than the number of GPUs in the cluster, some container will not be created.
|
||||
|
||||
|
|
Loading…
Reference in New Issue