SUBMARINE-56. Update documentation to describe single-node PyTorch integration. Contributed by Szilard Nemeth.

This commit is contained in:
Sunil G 2019-05-15 21:26:48 -07:00
parent d4c8858586
commit de01422c2e
10 changed files with 203 additions and 16 deletions

View File

@ -14,7 +14,7 @@
# Developer Guide
By default, submarine uses YARN service framework as runtime. If you want to add your own implementation. You can add a new `RuntimeFactory` implementation and configure following option to `submarine.xml` (which should be placed under same `$HADOOP_CONF_DIR`)
By default, Submarine uses YARN service framework as runtime. If you want to add your own implementation, you can add a new `RuntimeFactory` implementation and configure following option to `submarine.xml` (which should be placed under same `$HADOOP_CONF_DIR`)
```
<property>

View File

@ -18,4 +18,6 @@ Here're some examples about Submarine usage.
[Running Distributed CIFAR 10 Tensorflow Job](RunningDistributedCifar10TFJobs.html)
[Running Standalone CIFAR 10 PyTorch Job](RunningSingleNodeCifar10PTJobs.html)
[Running Zeppelin Notebook on YARN](RunningZeppelinOnYARN.html)

View File

@ -22,6 +22,8 @@ Goals of Submarine:
- Support run distributed Tensorflow jobs with simple configs.
- Support run standalone PyTorch jobs with simple configs.
- Support run user-specified Docker images.
- Support specify GPU and other resources.
@ -37,7 +39,9 @@ Click below contents if you want to understand more.
- [Examples](Examples.html)
- [How to write Dockerfile for Submarine jobs](WriteDockerfile.html)
- [How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.html)
- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html)
- [Developer guide](DeveloperGuide.html)

View File

@ -304,7 +304,7 @@ https://github.com/NVIDIA/nvidia-docker
### Tensorflow Image
There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. we can get basic docker images by referring to WriteDockerfile.md.
There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. We can get basic docker images by referring to [Write Dockerfile](WriteDockerfileTF.html).
### Test tensorflow in a docker container

View File

@ -293,7 +293,7 @@ https://github.com/NVIDIA/nvidia-docker
### Tensorflow Image
CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Sumbmarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件基础的Dockfile可参见WriteDockerfile.md
CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Submarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件基础的Dockfile可参见WriteDockerfileTF.md
### 测试 TF 环境

View File

@ -24,15 +24,18 @@ Optional:
- Enable YARN DNS. (When yarn service runtime is required.)
- Enable GPU on YARN support. (When GPU-based training is required.)
- Docker images for submarine jobs. (When docker container is required.)
- Docker images for Submarine jobs. (When docker container is required.)
```
# Get prebuilt docker images (No liability)
docker pull hadoopsubmarine/tf-1.13.1-gpu:0.0.1
# Or build your own docker images
docker build . -f Dockerfile.gpu.tf_1.13.1 -t tf-1.13.1-gpu-base:0.0.1
```
More details, please refer to
[How to write Dockerfile for Submarine jobs](WriteDockerfile.html)
For more details, please refer to:
- [How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.html)
- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html)
## Run jobs
@ -120,7 +123,7 @@ reported from `entry_script.py`.
### Submarine Configuration
For submarine internal configuration, please create a `submarine.xml` which should be placed under `$HADOOP_CONF_DIR`.
For Submarine internal configuration, please create a `submarine.xml` which should be placed under `$HADOOP_CONF_DIR`.
|Configuration Name | Description |
|:---- |:---- |
@ -235,7 +238,7 @@ Or you can use `yarn logs -applicationId <applicationId>` to get logs from CLI
## Build from source code
If you want to build submarine project by yourself, you can follow the steps:
If you want to build the Submarine project by yourself, you can follow the steps:
- Run 'mvn install -DskipTests' from Hadoop source top level once.

View File

@ -39,9 +39,9 @@ python generate_cifar10_tfrecords.py --data-dir=cifar-10-data
hadoop fs -put cifar-10-data/ /dataset/cifar-10-data
```
**Please note that:**
**Warning:**
YARN service doesn't allow multiple services with the same name, so please run following command
Please note that YARN service doesn't allow multiple services with the same name, so please run following command
```
yarn application -destroy <service-name>
```
@ -49,7 +49,7 @@ to delete services if you want to reuse the same service name.
## Prepare Docker images
Refer to [Write Dockerfile](WriteDockerfile.md) to build a Docker image or use prebuilt one.
Refer to [Write Dockerfile](WriteDockerfileTF.html) to build a Docker image or use prebuilt one.
## Run Tensorflow jobs
@ -92,6 +92,8 @@ Explanations:
- `>1` num_workers indicates it is a distributed training.
- Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server doesn't require GPU.
For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page!
*Outputs of distributed training*
Sample output of master:

View File

@ -0,0 +1,62 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Tutorial: Running a standalone Cifar10 PyTorch Estimator Example.
Currently, PyTorch integration with Submarine only supports PyTorch in standalone (non-distributed mode).
Please also note that HDFS as a data source is not yet supported by PyTorch.
## What is CIFAR-10?
CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset.
**Warning:**
Please note that YARN service doesn't allow multiple services with the same name, so please run following command
```
yarn application -destroy <service-name>
```
to delete services if you want to reuse the same service name.
## Prepare Docker images
Refer to [Write Dockerfile](WriteDockerfilePT.html) to build a Docker image or use prebuilt one.
## Running PyTorch jobs
### Run standalone training
```
export HADOOP_CLASSPATH="/home/systest/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar:/home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar"
/opt/hadoop/bin/yarn jar /home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar job run \
--name pytorch-job-001 \
--verbose \
--framework pytorch \
--wait_job_finish \
--docker_image pytorch-latest-gpu:0.0.1 \
--input_path hdfs://unused \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.2 \
--env YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true \
--num_workers 1 \
--worker_resources memory=5G,vcores=2 \
--worker_launch_cmd "cd /test/ && python cifar10_tutorial.py"
```
For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page!
**Remarks:**
Please note that the input path parameter is mandatory, but not yet used by the PyTorch docker container.

View File

@ -0,0 +1,114 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Creating Docker Images for Running PyTorch on YARN
## How to create docker images to run PyTorch on YARN
Dockerfile to run PyTorch on YARN needs two parts:
**Base libraries which PyTorch depends on**
1) OS base image, for example ```ubuntu:16.04```
2) PyTorch dependent libraries and packages. For example ```python```, ```scipy```. For GPU support, you also need ```cuda```, ```cudnn```, etc.
3) PyTorch package.
**Libraries to access HDFS**
1) JDK
2) Hadoop
Here's an example of a base image (with GPU support) to install PyTorch:
```
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
ARG PYTHON_VERSION=3.6
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
git \
curl \
vim \
ca-certificates \
libjpeg-dev \
libpng-dev \
wget &&\
rm -rf /var/lib/apt/lists/*
RUN curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
/opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include cython typing && \
/opt/conda/bin/conda install -y -c pytorch magma-cuda100 && \
/opt/conda/bin/conda clean -ya
ENV PATH /opt/conda/bin:$PATH
RUN pip install ninja
# This must be done before pip so that requirements.txt is available
WORKDIR /opt/pytorch
RUN git clone https://github.com/pytorch/pytorch.git
WORKDIR pytorch
RUN git submodule update --init
RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
pip install -v .
WORKDIR /opt/pytorch
RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v .
```
On top of above image, add files, install packages to access HDFS
```
RUN apt-get update && apt-get install -y openjdk-8-jdk wget
# Install hadoop
ENV HADOOP_VERSION="3.1.2"
RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz
RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current
RUN rm hadoop-${HADOOP_VERSION}.tar.gz
```
Build and push to your own docker registry: Use ```docker build ... ``` and ```docker push ...``` to finish this step.
## Use examples to build your own PyTorch docker images
We provided some example Dockerfiles for you to build your own PyTorch docker images.
For latest PyTorch
- *docker/pytorch/base/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest Pytorch that supports GPU, which is prebuilt to CUDA10.
- *docker/pytorch/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest Pytorch that GPU, which is prebuilt to CUDA10, with models.
## Build Docker images
### Manually build Docker image:
Under `docker/pytorch` directory, run `build-all.sh` to build all Docker images. This command will build the following Docker images:
- `pytorch-latest-gpu-base:0.0.1` for base Docker image which includes Hadoop, PyTorch, GPU base libraries.
- `pytorch-latest-gpu:0.0.1` which includes cifar10 model as well
### Use prebuilt images
(No liability)
You can also use prebuilt images for convenience:
- hadoopsubmarine/pytorch-latest-gpu-base:0.0.1

View File

@ -98,10 +98,10 @@ We provided following examples for you to build tensorflow docker images.
For Tensorflow 1.13.1 (Precompiled to CUDA 10.x)
- *docker/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only.
- *docker/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only, and included models
- *docker/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10.
- *docker/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models.
- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only.
- *docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only, and included models
- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10.
- *docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models.
## Build Docker images