SUBMARINE-56. Update documentation to describe single-node PyTorch integration. Contributed by Szilard Nemeth.
This commit is contained in:
parent
d4c8858586
commit
de01422c2e
|
@ -14,7 +14,7 @@
|
||||||
|
|
||||||
# Developer Guide
|
# Developer Guide
|
||||||
|
|
||||||
By default, submarine uses YARN service framework as runtime. If you want to add your own implementation. You can add a new `RuntimeFactory` implementation and configure following option to `submarine.xml` (which should be placed under same `$HADOOP_CONF_DIR`)
|
By default, Submarine uses YARN service framework as runtime. If you want to add your own implementation, you can add a new `RuntimeFactory` implementation and configure following option to `submarine.xml` (which should be placed under same `$HADOOP_CONF_DIR`)
|
||||||
|
|
||||||
```
|
```
|
||||||
<property>
|
<property>
|
||||||
|
|
|
@ -18,4 +18,6 @@ Here're some examples about Submarine usage.
|
||||||
|
|
||||||
[Running Distributed CIFAR 10 Tensorflow Job](RunningDistributedCifar10TFJobs.html)
|
[Running Distributed CIFAR 10 Tensorflow Job](RunningDistributedCifar10TFJobs.html)
|
||||||
|
|
||||||
|
[Running Standalone CIFAR 10 PyTorch Job](RunningSingleNodeCifar10PTJobs.html)
|
||||||
|
|
||||||
[Running Zeppelin Notebook on YARN](RunningZeppelinOnYARN.html)
|
[Running Zeppelin Notebook on YARN](RunningZeppelinOnYARN.html)
|
|
@ -22,6 +22,8 @@ Goals of Submarine:
|
||||||
|
|
||||||
- Support run distributed Tensorflow jobs with simple configs.
|
- Support run distributed Tensorflow jobs with simple configs.
|
||||||
|
|
||||||
|
- Support run standalone PyTorch jobs with simple configs.
|
||||||
|
|
||||||
- Support run user-specified Docker images.
|
- Support run user-specified Docker images.
|
||||||
|
|
||||||
- Support specify GPU and other resources.
|
- Support specify GPU and other resources.
|
||||||
|
@ -37,7 +39,9 @@ Click below contents if you want to understand more.
|
||||||
|
|
||||||
- [Examples](Examples.html)
|
- [Examples](Examples.html)
|
||||||
|
|
||||||
- [How to write Dockerfile for Submarine jobs](WriteDockerfile.html)
|
- [How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.html)
|
||||||
|
|
||||||
|
- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html)
|
||||||
|
|
||||||
- [Developer guide](DeveloperGuide.html)
|
- [Developer guide](DeveloperGuide.html)
|
||||||
|
|
||||||
|
|
|
@ -304,7 +304,7 @@ https://github.com/NVIDIA/nvidia-docker
|
||||||
|
|
||||||
### Tensorflow Image
|
### Tensorflow Image
|
||||||
|
|
||||||
There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. we can get basic docker images by referring to WriteDockerfile.md.
|
There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. We can get basic docker images by referring to [Write Dockerfile](WriteDockerfileTF.html).
|
||||||
|
|
||||||
### Test tensorflow in a docker container
|
### Test tensorflow in a docker container
|
||||||
|
|
||||||
|
|
|
@ -293,7 +293,7 @@ https://github.com/NVIDIA/nvidia-docker
|
||||||
|
|
||||||
### Tensorflow Image
|
### Tensorflow Image
|
||||||
|
|
||||||
CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Sumbmarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件,基础的Dockfile可参见WriteDockerfile.md
|
CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Submarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件,基础的Dockfile可参见WriteDockerfileTF.md
|
||||||
|
|
||||||
### 测试 TF 环境
|
### 测试 TF 环境
|
||||||
|
|
||||||
|
|
|
@ -24,15 +24,18 @@ Optional:
|
||||||
|
|
||||||
- Enable YARN DNS. (When yarn service runtime is required.)
|
- Enable YARN DNS. (When yarn service runtime is required.)
|
||||||
- Enable GPU on YARN support. (When GPU-based training is required.)
|
- Enable GPU on YARN support. (When GPU-based training is required.)
|
||||||
- Docker images for submarine jobs. (When docker container is required.)
|
- Docker images for Submarine jobs. (When docker container is required.)
|
||||||
```
|
```
|
||||||
# Get prebuilt docker images (No liability)
|
# Get prebuilt docker images (No liability)
|
||||||
docker pull hadoopsubmarine/tf-1.13.1-gpu:0.0.1
|
docker pull hadoopsubmarine/tf-1.13.1-gpu:0.0.1
|
||||||
# Or build your own docker images
|
# Or build your own docker images
|
||||||
docker build . -f Dockerfile.gpu.tf_1.13.1 -t tf-1.13.1-gpu-base:0.0.1
|
docker build . -f Dockerfile.gpu.tf_1.13.1 -t tf-1.13.1-gpu-base:0.0.1
|
||||||
```
|
```
|
||||||
More details, please refer to
|
For more details, please refer to:
|
||||||
[How to write Dockerfile for Submarine jobs](WriteDockerfile.html)
|
|
||||||
|
- [How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.html)
|
||||||
|
|
||||||
|
- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html)
|
||||||
|
|
||||||
## Run jobs
|
## Run jobs
|
||||||
|
|
||||||
|
@ -120,7 +123,7 @@ reported from `entry_script.py`.
|
||||||
|
|
||||||
### Submarine Configuration
|
### Submarine Configuration
|
||||||
|
|
||||||
For submarine internal configuration, please create a `submarine.xml` which should be placed under `$HADOOP_CONF_DIR`.
|
For Submarine internal configuration, please create a `submarine.xml` which should be placed under `$HADOOP_CONF_DIR`.
|
||||||
|
|
||||||
|Configuration Name | Description |
|
|Configuration Name | Description |
|
||||||
|:---- |:---- |
|
|:---- |:---- |
|
||||||
|
@ -235,7 +238,7 @@ Or you can use `yarn logs -applicationId <applicationId>` to get logs from CLI
|
||||||
|
|
||||||
## Build from source code
|
## Build from source code
|
||||||
|
|
||||||
If you want to build submarine project by yourself, you can follow the steps:
|
If you want to build the Submarine project by yourself, you can follow the steps:
|
||||||
|
|
||||||
- Run 'mvn install -DskipTests' from Hadoop source top level once.
|
- Run 'mvn install -DskipTests' from Hadoop source top level once.
|
||||||
|
|
||||||
|
|
|
@ -39,9 +39,9 @@ python generate_cifar10_tfrecords.py --data-dir=cifar-10-data
|
||||||
hadoop fs -put cifar-10-data/ /dataset/cifar-10-data
|
hadoop fs -put cifar-10-data/ /dataset/cifar-10-data
|
||||||
```
|
```
|
||||||
|
|
||||||
**Please note that:**
|
**Warning:**
|
||||||
|
|
||||||
YARN service doesn't allow multiple services with the same name, so please run following command
|
Please note that YARN service doesn't allow multiple services with the same name, so please run following command
|
||||||
```
|
```
|
||||||
yarn application -destroy <service-name>
|
yarn application -destroy <service-name>
|
||||||
```
|
```
|
||||||
|
@ -49,7 +49,7 @@ to delete services if you want to reuse the same service name.
|
||||||
|
|
||||||
## Prepare Docker images
|
## Prepare Docker images
|
||||||
|
|
||||||
Refer to [Write Dockerfile](WriteDockerfile.md) to build a Docker image or use prebuilt one.
|
Refer to [Write Dockerfile](WriteDockerfileTF.html) to build a Docker image or use prebuilt one.
|
||||||
|
|
||||||
## Run Tensorflow jobs
|
## Run Tensorflow jobs
|
||||||
|
|
||||||
|
@ -92,6 +92,8 @@ Explanations:
|
||||||
- `>1` num_workers indicates it is a distributed training.
|
- `>1` num_workers indicates it is a distributed training.
|
||||||
- Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server doesn't require GPU.
|
- Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server doesn't require GPU.
|
||||||
|
|
||||||
|
For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page!
|
||||||
|
|
||||||
*Outputs of distributed training*
|
*Outputs of distributed training*
|
||||||
|
|
||||||
Sample output of master:
|
Sample output of master:
|
||||||
|
|
|
@ -0,0 +1,62 @@
|
||||||
|
<!--
|
||||||
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
|
contributor license agreements. See the NOTICE file distributed with
|
||||||
|
this work for additional information regarding copyright ownership.
|
||||||
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||||
|
(the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
-->
|
||||||
|
# Tutorial: Running a standalone Cifar10 PyTorch Estimator Example.
|
||||||
|
|
||||||
|
Currently, PyTorch integration with Submarine only supports PyTorch in standalone (non-distributed mode).
|
||||||
|
Please also note that HDFS as a data source is not yet supported by PyTorch.
|
||||||
|
|
||||||
|
## What is CIFAR-10?
|
||||||
|
CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset.
|
||||||
|
|
||||||
|
**Warning:**
|
||||||
|
|
||||||
|
Please note that YARN service doesn't allow multiple services with the same name, so please run following command
|
||||||
|
```
|
||||||
|
yarn application -destroy <service-name>
|
||||||
|
```
|
||||||
|
to delete services if you want to reuse the same service name.
|
||||||
|
|
||||||
|
## Prepare Docker images
|
||||||
|
|
||||||
|
Refer to [Write Dockerfile](WriteDockerfilePT.html) to build a Docker image or use prebuilt one.
|
||||||
|
|
||||||
|
## Running PyTorch jobs
|
||||||
|
|
||||||
|
### Run standalone training
|
||||||
|
|
||||||
|
```
|
||||||
|
export HADOOP_CLASSPATH="/home/systest/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar:/home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar"
|
||||||
|
/opt/hadoop/bin/yarn jar /home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar job run \
|
||||||
|
--name pytorch-job-001 \
|
||||||
|
--verbose \
|
||||||
|
--framework pytorch \
|
||||||
|
--wait_job_finish \
|
||||||
|
--docker_image pytorch-latest-gpu:0.0.1 \
|
||||||
|
--input_path hdfs://unused \
|
||||||
|
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre \
|
||||||
|
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.2 \
|
||||||
|
--env YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true \
|
||||||
|
--num_workers 1 \
|
||||||
|
--worker_resources memory=5G,vcores=2 \
|
||||||
|
--worker_launch_cmd "cd /test/ && python cifar10_tutorial.py"
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page!
|
||||||
|
|
||||||
|
**Remarks:**
|
||||||
|
Please note that the input path parameter is mandatory, but not yet used by the PyTorch docker container.
|
|
@ -0,0 +1,114 @@
|
||||||
|
<!--
|
||||||
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
|
contributor license agreements. See the NOTICE file distributed with
|
||||||
|
this work for additional information regarding copyright ownership.
|
||||||
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||||
|
(the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Creating Docker Images for Running PyTorch on YARN
|
||||||
|
|
||||||
|
## How to create docker images to run PyTorch on YARN
|
||||||
|
|
||||||
|
Dockerfile to run PyTorch on YARN needs two parts:
|
||||||
|
|
||||||
|
**Base libraries which PyTorch depends on**
|
||||||
|
|
||||||
|
1) OS base image, for example ```ubuntu:16.04```
|
||||||
|
|
||||||
|
2) PyTorch dependent libraries and packages. For example ```python```, ```scipy```. For GPU support, you also need ```cuda```, ```cudnn```, etc.
|
||||||
|
|
||||||
|
3) PyTorch package.
|
||||||
|
|
||||||
|
**Libraries to access HDFS**
|
||||||
|
|
||||||
|
1) JDK
|
||||||
|
|
||||||
|
2) Hadoop
|
||||||
|
|
||||||
|
Here's an example of a base image (with GPU support) to install PyTorch:
|
||||||
|
```
|
||||||
|
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
|
||||||
|
ARG PYTHON_VERSION=3.6
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
build-essential \
|
||||||
|
cmake \
|
||||||
|
git \
|
||||||
|
curl \
|
||||||
|
vim \
|
||||||
|
ca-certificates \
|
||||||
|
libjpeg-dev \
|
||||||
|
libpng-dev \
|
||||||
|
wget &&\
|
||||||
|
rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
|
||||||
|
RUN curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
|
||||||
|
chmod +x ~/miniconda.sh && \
|
||||||
|
~/miniconda.sh -b -p /opt/conda && \
|
||||||
|
rm ~/miniconda.sh && \
|
||||||
|
/opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include cython typing && \
|
||||||
|
/opt/conda/bin/conda install -y -c pytorch magma-cuda100 && \
|
||||||
|
/opt/conda/bin/conda clean -ya
|
||||||
|
ENV PATH /opt/conda/bin:$PATH
|
||||||
|
RUN pip install ninja
|
||||||
|
# This must be done before pip so that requirements.txt is available
|
||||||
|
WORKDIR /opt/pytorch
|
||||||
|
RUN git clone https://github.com/pytorch/pytorch.git
|
||||||
|
WORKDIR pytorch
|
||||||
|
RUN git submodule update --init
|
||||||
|
RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
|
||||||
|
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
|
||||||
|
pip install -v .
|
||||||
|
|
||||||
|
WORKDIR /opt/pytorch
|
||||||
|
RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v .
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
On top of above image, add files, install packages to access HDFS
|
||||||
|
```
|
||||||
|
RUN apt-get update && apt-get install -y openjdk-8-jdk wget
|
||||||
|
# Install hadoop
|
||||||
|
ENV HADOOP_VERSION="3.1.2"
|
||||||
|
RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
|
||||||
|
RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz
|
||||||
|
RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current
|
||||||
|
RUN rm hadoop-${HADOOP_VERSION}.tar.gz
|
||||||
|
```
|
||||||
|
|
||||||
|
Build and push to your own docker registry: Use ```docker build ... ``` and ```docker push ...``` to finish this step.
|
||||||
|
|
||||||
|
## Use examples to build your own PyTorch docker images
|
||||||
|
|
||||||
|
We provided some example Dockerfiles for you to build your own PyTorch docker images.
|
||||||
|
|
||||||
|
For latest PyTorch
|
||||||
|
|
||||||
|
- *docker/pytorch/base/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest Pytorch that supports GPU, which is prebuilt to CUDA10.
|
||||||
|
- *docker/pytorch/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest Pytorch that GPU, which is prebuilt to CUDA10, with models.
|
||||||
|
|
||||||
|
## Build Docker images
|
||||||
|
|
||||||
|
### Manually build Docker image:
|
||||||
|
|
||||||
|
Under `docker/pytorch` directory, run `build-all.sh` to build all Docker images. This command will build the following Docker images:
|
||||||
|
|
||||||
|
- `pytorch-latest-gpu-base:0.0.1` for base Docker image which includes Hadoop, PyTorch, GPU base libraries.
|
||||||
|
- `pytorch-latest-gpu:0.0.1` which includes cifar10 model as well
|
||||||
|
|
||||||
|
### Use prebuilt images
|
||||||
|
|
||||||
|
(No liability)
|
||||||
|
You can also use prebuilt images for convenience:
|
||||||
|
|
||||||
|
- hadoopsubmarine/pytorch-latest-gpu-base:0.0.1
|
|
@ -98,10 +98,10 @@ We provided following examples for you to build tensorflow docker images.
|
||||||
|
|
||||||
For Tensorflow 1.13.1 (Precompiled to CUDA 10.x)
|
For Tensorflow 1.13.1 (Precompiled to CUDA 10.x)
|
||||||
|
|
||||||
- *docker/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only.
|
- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only.
|
||||||
- *docker/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only, and included models
|
- *docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only, and included models
|
||||||
- *docker/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10.
|
- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10.
|
||||||
- *docker/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models.
|
- *docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models.
|
||||||
|
|
||||||
## Build Docker images
|
## Build Docker images
|
||||||
|
|
Loading…
Reference in New Issue