2017년 6월 8일 목요일

POWER8에서 tensorflow docker image를 build하여 Inception v3 training 수행하기

먼저 docker hub에서 전에 만들어둔 cuda8과 cudnn5-devel package가 설치된 ppc64le 기반의 Ubuntu 16.04 LTS docker image를 pull 합니다.

root@sys-87548:/home/u0017496# docker pull bsyu/cuda8-cudnn5-devel:cudnn5-devel
cudnn5-devel: Pulling from bsyu/cuda8-cudnn5-devel
ffa99da61f7b: Extracting 41.78 MB/72.3 MB
6b239e02a89e: Download complete
aecbc9abccdc: Downloading 110.8 MB/415.3 MB
8f458a3f0497: Download complete
4903f7ce6675: Download complete
0c588ac98d19: Downloading   107 MB/450.9 MB
12e624e884fc: Download complete
18dd28bbb571: Downloading 45.37 MB/103.2 MB
...

이를 기반으로 PowerAI에 포함된 tensorflow를 설치한 docker image를 build 합니다.  먼저 dockerfile을 다음과 같이 만듭니다.

root@sys-87548:/home/u0017496# vi dockerfile.tensorflow

FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
RUN apt-get update && apt-get install -y nvidia-modprobe

RUN mkdir /tmp/temp
COPY libcudnn5* /tmp/temp/
COPY cuda-repo* /tmp/temp/
COPY mldl-repo* /tmp/temp/

RUN dpkg -i /tmp/temp/cuda-repo*deb && \
    dpkg -i /tmp/temp/libcudnn5*deb && \
    dpkg -i /tmp/temp/mldl-repo*deb && \
    rm -rf /tmp/temp && \
    apt-get update && apt-get install -y tensorflow && \
    rm -rf /var/lib/apt/lists/* && \
    dpkg -r mldl-repo-local

# set the working directory
WORKDIR /opt/DL/caffe-nv/bin
ENV LD_LIBRARY_PATH="/usr/local/nvidia/lib64:/usr/local/cuda-8.0/targets/ppc64le-linux/lib/stubs:/usr/lib/powerpc64le-linux-gnu/stubs:/usr/lib/powerpc64le-linux-gnu:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
ENV PATH="/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
ENV PYTHONPATH="/opt/DL/tensorflow/lib/python2.7/site-packages"

CMD /bin/bash

이제 이 dockerfile을 기반으로 bsyu/tensor_r1.0:ppc64le-xenial의 docker image를 빌드합니다.

root@sys-87548:/home/u0017496# docker build -t bsyu/tensor_r1.0:ppc64le-xenial -f dockerfile.tensorflow .
Sending build context to Docker daemon 3.436 GB
Step 1 : FROM bsyu/cuda8-cudnn5-devel:cudnn5-devel
 ---> d8d0da2fbdf2
Step 2 : RUN apt-get update && apt-get install -y nvidia-modprobe
 ---> Running in 204fe4e2c5f6
Ign:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  InRelease
Get:2 http://ports.ubuntu.com/ubuntu-ports xenial InRelease [247 kB]
Get:3 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  Release [565 B]
Get:4 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  Release.gpg [819 B]
Get:5 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  Packages [24.9 kB]
Get:6 http://ports.ubuntu.com/ubuntu-ports xenial-updates InRelease [102 kB]
Get:7 http://ports.ubuntu.com/ubuntu-ports xenial-security InRelease [102 kB]
Get:8 http://ports.ubuntu.com/ubuntu-ports xenial/main ppc64el Packages [1470 kB]
Get:9 http://ports.ubuntu.com/ubuntu-ports xenial/universe ppc64el Packages [9485 kB]
Get:10 http://ports.ubuntu.com/ubuntu-ports xenial/multiverse ppc64el Packages [152 kB]
Get:11 http://ports.ubuntu.com/ubuntu-ports xenial-updates/main ppc64el Packages [613 kB]
Get:12 http://ports.ubuntu.com/ubuntu-ports xenial-updates/universe ppc64el Packages [528 kB]
Get:13 http://ports.ubuntu.com/ubuntu-ports xenial-updates/multiverse ppc64el Packages [5465 B]
Get:14 http://ports.ubuntu.com/ubuntu-ports xenial-security/main ppc64el Packages [286 kB]
Get:15 http://ports.ubuntu.com/ubuntu-ports xenial-security/universe ppc64el Packages [138 kB]
Fetched 13.2 MB in 10s (1230 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  nvidia-modprobe
0 upgraded, 1 newly installed, 0 to remove and 83 not upgraded.
Need to get 16.3 kB of archives.
After this operation, 85.0 kB of additional disk space will be used.
Get:1 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/ppc64el  nvidia-modprobe 375.51-0ubuntu1 [16.3 kB]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Fetched 16.3 kB in 0s (191 kB/s)
Selecting previously unselected package nvidia-modprobe.
(Reading database ... 17174 files and directories currently installed.)
Preparing to unpack .../nvidia-modprobe_375.51-0ubuntu1_ppc64el.deb ...
Unpacking nvidia-modprobe (375.51-0ubuntu1) ...
Setting up nvidia-modprobe (375.51-0ubuntu1) ...
 ---> 5411319bbc05
Removing intermediate container 204fe4e2c5f6
Step 3 : RUN mkdir /tmp/temp
 ---> Running in cf13b03845f1
 ---> 66b2b250777f
Removing intermediate container cf13b03845f1
Step 4 : COPY libcudnn5* /tmp/temp/
 ---> 16d921e53451
Removing intermediate container 9d1efa9ed269
Step 5 : COPY cuda-repo* /tmp/temp/
...
Step 9 : ENV LD_LIBRARY_PATH "/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:/opt/DL/tensorflow/lib:/usr/lib:/usr/local/lib"
 ---> Running in fe30af7c944e
 ---> f5faa1760ac7
Removing intermediate container fe30af7c944e
Step 10 : ENV PATH "/opt/ibm/xlC/current/bin:/opt/ibm/xlf/current/bin:/opt/at10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/DL/bazel/bin:/opt/DL/tensorflow/bin"
 ---> Running in 98a0e5bfd008
 ---> 7cfb0feaaee1
Removing intermediate container 98a0e5bfd008
Step 11 : ENV PYTHONPATH "/opt/DL/tensorflow/lib/python2.7/site-packages"
 ---> Running in d98d5352108e
 ---> affda7b26276
Removing intermediate container d98d5352108e
Step 12 : CMD /bin/bash
 ---> Running in d54a20fb7e3c
 ---> 4692368fb7ad
Removing intermediate container d54a20fb7e3c
Successfully built 4692368fb7ad


만들어진 docker image를 확인합니다.

root@sys-87548:/home/u0017496# docker images
REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
bsyu/tensor_r1.0          ppc64le-xenial      4692368fb7ad        3 minutes ago       6.448 GB
nvidia-docker             deb                 2830f66f0418        41 hours ago        429.8 MB
nvidia-docker             build               fa764787622c        41 hours ago        1.014 GB
ppc64le/ubuntu            14.04               0e6701cbf611        2 weeks ago         228.5 MB
bsyu/cuda8-cudnn5-devel   cudnn5-devel        d8d0da2fbdf2        4 months ago        1.895 GB
ppc64le/golang            1.6.3               6a579d02d32f        9 months ago        704.7 MB
golang                    1.5                 99668503de15        10 months ago       725.3 MB


이 docker image를 나중에 다른 서버에서도 사용하기 위해 docker hub에 push 해둡니다.

root@sys-87548:/home/u0017496# docker push bsyu/tensor_r1.0:ppc64le-xenial
The push refers to a repository [docker.io/bsyu/tensor_r1.0]
f42db0829239: Pushed
6a6b4d4d9d2a: Pushing 184.1 MB/2.738 GB
6458d0633f20: Pushing 172.7 MB/390.2 MB
726e25ffdf3c: Pushing 173.2 MB/1.321 GB
1535936ab54b: Pushed
bc0917851737: Pushed
9a1e25cd5998: Pushed
c0fe73e43621: Mounted from bsyu/cuda8-cudnn5-devel
4ce979019d1d: Mounted from bsyu/cuda8-cudnn5-devel
724befd94678: Mounted from bsyu/cuda8-cudnn5-devel
84f99f1bf79b: Mounted from bsyu/cuda8-cudnn5-devel
7f7c1dccec82: Mounted from bsyu/cuda8-cudnn5-devel
5b8880a35736: Mounted from bsyu/cuda8-cudnn5-devel
41b97cb9a404: Mounted from bsyu/cuda8-cudnn5-devel
08f34ce6b3fb: Mounted from bsyu/cuda8-cudnn5-devel


이제 docker image는 준비되었으니, tensorflow를 이용한 inception v3를 수행할 준비를 합니다.  inception v3 training을 수행할 python package를 bazel를 이용하여 /home/inception directory 밑에 만듭니다.  이 directory를 나중에 docker image에서 mount하여 사용할 예정입니다.

root@sys-87548:/home# mkdir inception
root@sys-87548:/home# export INCEPTION_DIR=/home/inception

root@sys-87548:/home# cd inception/
root@sys-87548:/home/inception# curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  380M  100  380M    0     0  4205k      0  0:01:32  0:01:32 --:--:-- 4988k

root@sys-87548:/home/inception# tar -xvf inception-v3-2016-03-01.tar.gz
inception-v3/
inception-v3/checkpoint
inception-v3/README.txt
inception-v3/model.ckpt-157585

root@sys-87548:/home/inception# git clone https://github.com/tensorflow/models.git
Cloning into 'models'...
remote: Counting objects: 4866, done.
remote: Total 4866 (delta 0), reused 0 (delta 0), pack-reused 4866
Receiving objects: 100% (4866/4866), 153.36 MiB | 5.23 MiB/s, done.
Resolving deltas: 100% (2467/2467), done.
Checking connectivity... done.

root@sys-87548:/home/inception# export FLOWERS_DIR=/home/inception/models/inception
root@sys-87548:/home/inception# mkdir -p $FLOWERS_DIR/data
root@sys-87548:/home/inception# cd models/inception/
root@sys-87548:/home/inception/models/inception# . /opt/DL/bazel/bin/bazel-activate
root@sys-87548:/home/inception/models/inception# . /opt/DL/tensorflow/bin/tensorflow-activate

root@sys-87548:/home/inception/models/inception# export TEST_TMPDIR=/home/inception/.cache

root@sys-87548:/home/inception/models/inception# bazel build inception/download_and_preprocess_flowers
INFO: $TEST_TMPDIR defined: output root default is '/home/inception/.cache'.
Extracting Bazel installation...
..............
INFO: Found 1 target...
Target //inception:download_and_preprocess_flowers up-to-date:
  bazel-bin/inception/download_and_preprocess_flowers
INFO: Elapsed time: 5.831s, Critical Path: 0.02s

root@sys-87548:/home/inception/models/inception# ls -l
total 76
lrwxrwxrwx 1 root root   116 Jun  8 02:36 bazel-bin -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/bin
lrwxrwxrwx 1 root root   121 Jun  8 02:36 bazel-genfiles -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/genfiles
lrwxrwxrwx 1 root root    86 Jun  8 02:36 bazel-inception -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception
lrwxrwxrwx 1 root root    96 Jun  8 02:36 bazel-out -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out
lrwxrwxrwx 1 root root   121 Jun  8 02:36 bazel-testlogs -> /home/inception/.cache/_bazel_root/69ffd0b4da93db0b8142429400cccda5/execroot/inception/bazel-out/local-fastbuild/testlogs
drwxr-xr-x 2 root root  4096 Jun  8 02:32 data
drwxr-xr-x 2 root root  4096 Jun  8 02:29 g3doc
drwxr-xr-x 4 root root  4096 Jun  8 02:29 inception
-rw-r--r-- 1 root root 38480 Jun  8 02:29 README.md
-rw-r--r-- 1 root root    30 Jun  8 02:29 WORKSPACE


root@sys-87548:/home/inception/models/inception# bazel-bin/inception/download_and_preprocess_flowers $FLOWERS_DIR/data
Downloading flower data set.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  218M  100  218M    0     0  4649k      0  0:00:48  0:00:48 --:--:-- 5105k
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
...
Found 3170 JPEG files across 5 labels inside /home/u0017496/inception/models/inception/data/raw-data/train.
Launching 2 threads for spacings: [[0, 1585], [1585, 3170]]
2017-06-08 02:01:56.169564 [thread 1]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:01:56.268917 [thread 0]: Processed 1000 of 1585 images in thread batch.
2017-06-08 02:02:01.252583 [thread 1]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00001-of-00002
2017-06-08 02:02:01.252638 [thread 1]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.306138 [thread 0]: Wrote 1585 images to /home/u0017496/inception/models/inception/data/train-00000-of-00002
2017-06-08 02:02:01.306178 [thread 0]: Wrote 1585 images to 1585 shards.
2017-06-08 02:02:01.578737: Finished writing all 3170 images in data set.

inception v3는 다음과 같이 꽃 사진을 종류별로 인식하는 model입니다.

root@sys-87548:/home/inception/models/inception# du -sm data/raw-data/train/*
29      data/raw-data/train/daisy
43      data/raw-data/train/dandelion
1       data/raw-data/train/LICENSE.txt
34      data/raw-data/train/roses
47      data/raw-data/train/sunflowers
48      data/raw-data/train/tulips

이제 flowers_train을 bazel로 build 합니다.

root@sys-87548:/home/inception/models/inception# bazel build inception/flowers_train
INFO: Found 1 target...
Target //inception:flowers_train up-to-date:
  bazel-bin/inception/flowers_train
INFO: Elapsed time: 0.311s, Critical Path: 0.03s

이제 준비가 끝났습니다.  Docker에서 수행하기 전에, 이 flowers_train을 그냥 그대로 수행해 봅니다.

root@sys-87548:/home/inception/models/inception# time bazel-bin/inception/flowers_train --train_dir=$FLOWERS_DIR/train --data_dir=$FLOWERS_DIR/data --pretrained_model_checkpoint_path=$INCEPTION_DIR/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

이 서버에는 cuda가 설치되어 있지만 GPU는 없으므로, 아래와 같이 error message가 보이는 것이 당연합니다.  GPU가 없으면 그냥 CPU에서 수행됩니다.  약 20분 이상 수행되므로 도중에 끊겠습니다.

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
NVIDIA: no NVIDIA devices found
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_UNKNOWN
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (sys-87548): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 02:41:53.587744: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 02:44:28.213350: step 0, loss = 2.85 (0.2 examples/sec; 38.569 sec/batch)
...


이제 앞서 만들어 두었던 bsyu/tensor_r1.0:ppc64le-xenial라는 이름의 docker image를 이용하여 inception v3를 수행하겠습니다.  실제 flowers_train은 /home/inception 밑에 들어있으므로, 이 directory를 -v option을 이용하여 docker image에서도 mount 하도록 합니다.

root@sys-87548:/home/inception/models/inception# docker run  --rm -v /home/inception:/home/inception bsyu/tensor_r1.0:ppc64le-xenial /home/inception/models/inception/bazel-bin/inception/flowers_train --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
...
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (b85c9a819a6a): /proc/driver/nvidia/version does not exist
W tensorflow/compiler/xla/service/platform_util.cc:61] platform CUDA present but no visible devices found
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 2 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
2017-06-08 06:48:27.996200: Pre-trained model restored from /home/inception/inception-v3/model.ckpt-157585
2017-06-08 06:51:10.935895: step 0, loss = 2.83 (0.2 examples/sec; 39.389 sec/batch)
2017-06-08 06:56:21.408996: step 10, loss = 2.55 (0.4 examples/sec; 19.373 sec/batch)
2017-06-08 06:59:29.431547: step 20, loss = 2.33 (0.4 examples/sec; 19.856 sec/batch)
2017-06-08 07:02:36.828205: step 30, loss = 2.33 (0.4 examples/sec; 19.014 sec/batch)
2017-06-08 07:05:46.372759: step 40, loss = 2.17 (0.4 examples/sec; 18.428 sec/batch)

잘 수행되는 것을 보실 수 있습니다.   수행되는 중간에 Parent OS에서 nmon으로 관측해보면 python이 CPU 대부분을 사용하는 것을 보실 수 있습니다.  이 python process의 parent PID는 물론 docker daemon입니다.


root@sys-87548:/home/u0017496# ps -ef | grep 14190 | grep -v grep
root     14190 14173 78 02:46 ?        00:00:53 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8

root@sys-87548:/home/u0017496# ps -ef | grep 14173 | grep -v grep
root     14173 15050  0 02:46 ?        00:00:00 docker-containerd-shim b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 /var/run/docker/libcontainerd/b85c9a819a6a497466ea5036a16abc036f0a26809be678224b59ad1b31646178 docker-runc
root     14190 14173 80 02:46 ?        00:01:06 /usr/bin/python /home/inception/models/inception/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py --train_dir=/home/inception/models/inception/train --data_dir=/home/inception/models/inception/data --pretrained_model_checkpoint_path=/home/inception/inception-v3/model.ckpt-157585 --fine_tune=True --initial_learning_rate=0.001 -input_queue_memory_factor=1 --max_steps=50 --num_gpus 1 --batch_size=8


이제 다음 posting에서는 이 서버와 다른 서버에서 LSF를 통해 이 docker image로 inception v3를 training하는 것을 보시겠습니다.  이 서버인 sys-87548 외에, sys-87549에도 docker를 설치하고 docker image를 pull 해두고, 또 여기서 build된 /home/inception directory를 scp를 통해 sys-87549 서버에도 복사해 둡니다.

root@sys-87549:/home/u0017496# docker pull bsyu/tensor_r1.0:ppc64le-xenial

root@sys-87549:/home/u0017496# scp -r sys-87548:/home/inception /home



댓글 없음:

댓글 쓰기