HW 엔지니어를 위한 Deep Learning: GPU를 이용하는 Caffe training을 위한 LSF 환경 setup

LSF를 GPU 환경에서 사용하는 가장 큰 이유는 값비싼 GPU 자원을 여러 deep learning 연구원이 공동으로 사용하는 것을 편리하게 해주기 때문입니다. 가령 내가 GPU 2장을 이용한 training 작업을 걸어야 하는데, 전체 4장의 GPU 중 3장을 누군가 다른 연구원들이 쓰고 있다면 그 작업들이 끝날 때까지 기다려야 합니다. 그 작업들이 언제 끝날 줄 알고 기다리겠습니까 ? 그냥 작업을 돌려 놓고 퇴근하거나 다른 연구에 집중하면 좋겠는데, 무턱대고 그렇게 job을 돌리면 error가 나거나, 다른 연구원이 애써 수행 중인 training job까지 망쳐놓기 딱 좋기 때문에 그럴 수도 없습니다.

이때 필요한 것이 IBM Spectrum LSF입니다. GPU를 위한 LSF 설정 방법을 caffe를 예로 삼아 여기에 정리했습니다.

여기서는 NVIDIA K80 GPU 2장 (GK210 GPU * 4장)이 설치된 IBM POWER8 GPU 서버인 S822LC 서버, 흔히 code명 Firestone으로 불리는 서버 1대를 사용했습니다. OS는 물론 ppc64le 기반의 Ubuntu 16.04 LTS 입니다.

먼저, 다음과 같이 Spectrum LSF HPC Suite를 설치합니다. 정확하게는 HPC Suite 전체를 설치하는 것이 아니라, 여기서는 그 속에 들었는 LSF만을 설치하는 것입니다. 참고로 HPC Suite 속에는 LSF 뿐만 아니라 LS(License Server), PAC(Platform Application Center), PPM(Platform Process Manager), SMPI(Spectrum MPI) 등이 함께 들어 있습니다. 그러나 여기서는 다 필요없고 LSF만 있으면 됩니다.

이 HPC Suite에 들어있는 LSF를 사용하기 위해서는 lsf_std_entitlement.dat 라는 이름의 standard edition용 entitlement file이 필요하고, 이는 license를 정식으로 구매하실 때 별도로 제공됩니다. 정식 버전의 LSF가 아닌, 무료로 사용할 수 있는 Communitty Edition도 있고, 그 설치/사용방법은 이 standard edition과 동일합니다. 단, 일부 기능에 제약이 있습니다.

root@ubuntu02:/home/test# tar -zxvf lsfshpc10.1.1-ppc64le.tar.gz

test@ubuntu02:~/lsfshpc10.1.1-ppc64le$ ls
ls lsf pac ppm smpi

root@ubuntu02:/home/test# cd lsfshpc10.1.1-ppc64le/lsf/

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf# ls
lsf10.1_lnx310-lib217-ppc64le.tar.Z lsf10.1_lsfinstall_linux_ppc64le.tar.Z

위와 같이 LSF directory 속에는 두개의 Z 압축 파일이 있는데, 이중 install_ 어쩌고 하는 file만 압축해제하시면 됩니다. lib 어쩌고 하는 이름의 file은 압축해제하시면 안됩니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf# zcat lsf10.1_lsfinstall_linux_ppc64le.tar.Z | tar xvf -

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf# cd lsf10.1_lsfinstall

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# ls
conf_tmpl instlib lsf_unix_install.pdf pversions rpm
hostsetup lap patchinstall README scripts
install.config lsfinstall patchlib rhostsetup slave.config

이제 저 install.config를 수정하면 됩니다. 모두 직관적으로 아실 수 있는 이름들의 parameter인데, LSF_MASTER_LIST에는 원래 빈칸(space)로 구분된 여러대의 서버 이름을 적으시는 것입니다. 리스트의 맨 앞에 있는 서버가 active master이고, 그 뒤에 있는 것들이 secondary master들이 됩니다. 여기서는 master이자 slave인 서버가 딱 1대 (ubuntu02) 있으므로, 1대의 이름만 적었습니다.
LSF_ADD_SERVERS에는 실제로 job을 수행할 slave 서버들을 적으셔야 하는데, 역시 빈칸(space)로 구분되는 서버 이름들을 적으시면 됩니다. 여기서는 ubuntu02 1대만 적습니다.
LSF_TARDIR에는 위에서 압축해제하지 말라고 말씀드린, lsf10.1_lnx310-lib217-ppc64le.tar.Z 파일이 들어있는 directory 이름을 적으시면 됩니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# vi install.config
LSF_TOP="/usr/share/lsf"
LSF_ADMINS="test"
LSF_CLUSTER_NAME="firestone"
LSF_MASTER_LIST="ubuntu02"
LSF_TARDIR="/home/test/lsfshpc10.1.1-ppc64le/lsf"
# CONFIGURATION_TEMPLATE="DEFAULT|PARALLEL|HIGH_THROUGHPUT"
LSF_ADD_SERVERS="ubuntu02"

수정이 끝나면 아래와 같이 그 config 파일로 lsfinstall 명령을 수행합니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# ./lsfinstall -f install.config

그리고 위에서 언급한, 미리 받아둔 standard edition용 entitlement file을 다음과 같이 제 위치에 복사합니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# cp /home/test/lsf_std_entitlement.dat /usr/share/lsf/conf/lsf.entitlement

이것이 끝나면 원래 slave 서버에서 수행해야 하는 hostsetup 명령을 수행합니다. (다시 말씀드리지만 여기서는 ubuntu02 서버가 master이자 slave입니다.) --boot="y" 옵션을 쓰시면 부팅할 때마다 LSF daemon이 자동으로 구동됩니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# ./hostsetup --top="/usr/share/lsf" --boot="y"

그리고나서 .bashrc 등에 아래와 같이 /usr/share/lsf/conf/profile.lsf가 항상 수행되도록 등록해줍니다. root 사용자에서 뿐만 아니라, 위에서 LSF admin으로 등록한 test 사용자에서도 같은 entry를 .bashrc에 넣어 줍니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# vi /root/.bashrc
. /usr/share/lsf/conf/profile.lsf

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# . /root/.bashrc

또한 LSF를 sudo 권한으로 수행할 수 있도록 test 사용자를 아래 file에 등록해줍니다. 단, 이 /etc/lsf.sudoers의 permission은 반드시 600, owner는 root:root 여야 합니다.

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# sudo vi /etc/lsf.sudoers
LSB_PRE_POST_EXEC_USER=test
LSF_STARTUP_PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc
LSF_STARTUP_USERS="test"

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# ls -l /etc/lsf.sudoers
-rw------- 1 root root 126 Jun 29 17:38 /etc/lsf.sudoers

이제 LSF daemon들을 구동합니다. 원래 여러개가 있는데, 하나하나 따로 할 필요없이 lsfstartup으로 시작하고 lsfshutdown으로 끝내면 됩니다. Master에서 전체 cluster들의 daemon을 다 한꺼번에 살리고 내릴 수 있습니다. 물론 이를 위해서는 passwd 문답 없이도 ssh가 되도록 ssh id를 미리 copy해놓아야 합니다. 여기서는 1대의 서버가 master/slave 노릇을 다 합니다만, 스스로에 대해서도 passwd 문답 없이 ssh가 되도록 설정을 미리 해두어야 합니다. (여기서는 그 과정 생략했습니다. 그에 대해서는 https://hwengineer.blogspot.kr/2017/06/power8-lsf-tensorflow-docker-image.html 을 참조하십시요.)

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# which lsfstartup
/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/bin/lsfstartup

root@ubuntu02:/home/test/lsfshpc10.1.1-ppc64le/lsf/lsf10.1_lsfinstall# lsfstartup
Starting up all LIMs ...
Do you really want to start up LIM on all hosts ? [y/n]y
Start up LIM on <ubuntu02> ...... done

Waiting for Master LIM to start up ... Master LIM is ok
Starting up all RESes ...
Do you really want to start up RES on all hosts ? [y/n]y
Start up RES on <ubuntu02> ...... done

Starting all slave daemons on LSBATCH hosts ...
Do you really want to start up slave batch daemon on all hosts ? [y/n] y
Start up slave batch daemon on <ubuntu02> ...... done

Done starting up LSF daemons on the local LSF cluster ...

일단 LSF cluster는 구성이 되었습니다. 그러나 여기서 그대로 GPU를 이용하는 caffe job을 submit하면 error가 나는 것을 보실 수 있을 겁니다. 그 이유와 해결 방법에 대해서 찬찬히 살펴보겠습니다.

먼저, caffe를 이용하여 CIFAR-10 모델을 training하기 위한 준비를 하겠습니다.

편의를 위해, 아래와 같이 test 사용자가 passwd 문답 없이도 sudo를 수행할 수 있도록 설정을 하겠습니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo vi /etc/sudoers
...
test ALL=(ALL) NOPASSWD: ALL

이제 PowerAI toolkit에 포함된 NVIDIA Caffe (caffe-nv)를 이용하여 CIFAR-10 data와 script를 준비하겠습니다. 쉽습니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo ./data/cifar10/get_cifar10.sh
Downloading...
--2017-07-04 10:37:53-- http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
Resolving www.cs.toronto.edu (www.cs.toronto.edu)... 128.100.3.30
Connecting to www.cs.toronto.edu (www.cs.toronto.edu)|128.100.3.30|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170052171 (162M) [application/x-gzip]
Saving to: ‘cifar-10-binary.tar.gz’

cifar-10-binary.tar.gz 100%[================================>] 162.17M 4.10MB/s in 25s

2017-07-04 10:38:19 (6.59 MB/s) - ‘cifar-10-binary.tar.gz’ saved [170052171/170052171]

Unzipping...
Done.

일부 script의 PATH는 잘못 되어 있으므로 아래와 같이 수정해줍니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo vi ./examples/cifar10/create_cifar10.sh
...
if [ -z "$CAFFE_BIN" ]; then
# EXAMPLES=./build/$EXAMPLE
EXAMPLES=./bin
# TOOLS=./build/tools
TOOLS=./bin
else
...

이제 아래와 같이 CIFAR-10 LMDB를 생성합니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib ./examples/cifar10/create_cifar10.sh
Creating lmdb...
I0704 10:58:13.721052 84760 db_lmdb.cpp:35] Opened lmdb examples/cifar10/cifar10_train_lmdb
I0704 10:58:13.721252 84760 convert_cifar_data.cpp:52] Writing Training data
I0704 10:58:13.721264 84760 convert_cifar_data.cpp:55] Training Batch 1
I0704 10:58:13.764257 84760 convert_cifar_data.cpp:55] Training Batch 2
I0704 10:58:13.801908 84760 convert_cifar_data.cpp:55] Training Batch 3
I0704 10:58:13.830626 84760 convert_cifar_data.cpp:55] Training Batch 4
I0704 10:58:13.877624 84760 convert_cifar_data.cpp:55] Training Batch 5
I0704 10:58:18.264618 84760 convert_cifar_data.cpp:73] Writing Testing data
I0704 10:58:18.264998 84760 db_lmdb.cpp:35] Opened lmdb examples/cifar10/cifar10_test_lmdb
Computing image mean...
Done.

이제 CIFAR-10 training을 시작할 준비가 끝났습니다. 기본으로 제공되는 train_quick.sh을 그냥 수행해보면 아래와 같이 1장의 GPU를 이용해 training이 잘 수행됩니다. (여기서도 아래처럼 build 대신 bin directory로 일부 script 내용을 고쳐야 합니다.)

test@ubuntu02:/opt/DL/caffe-nv$ sudo vi ./examples/cifar10/train_quick.sh
...
if [ -z "$CAFFE_BIN" ]; then
# TOOLS=./build/tools
TOOLS=./bin
else
TOOLS=$CAFFE_BIN
fi

$TOOLS/caffe train \
--solver=examples/cifar10/cifar10_quick_solver.prototxt

test@ubuntu02:/opt/DL/caffe-nv$ sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
...
I0704 11:48:33.841511 87435 sgd_solver.cpp:106] Iteration 4800, lr = 0.0001
I0704 11:48:35.339391 87435 solver.cpp:242] Iteration 4900 (66.76 iter/s, 1.4979s/100 iter), loss = 0.38117
I0704 11:48:35.339428 87435 solver.cpp:261] Train net output #0: loss = 0.38117 (* 1 = 0.38117 loss)
I0704 11:48:35.339442 87435 sgd_solver.cpp:106] Iteration 4900, lr = 0.0001
I0704 11:48:36.822921 87435 solver.cpp:489] Snapshotting to HDF5 file examples/cifar10/cifar10_quick_iter_5000.caffemodel.h5
I0704 11:48:36.824291 87435 sgd_solver.cpp:283] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_quick_iter_5000.solverstate.h5
I0704 11:48:36.829028 87435 solver.cpp:342] Iteration 5000, loss = 0.456113
I0704 11:48:36.829043 87435 solver.cpp:362] Iteration 5000, Testing net (#0)
I0704 11:48:37.224135 87435 solver.cpp:429] Test net output #0: accuracy = 0.7594
I0704 11:48:37.224155 87435 solver.cpp:429] Test net output #1: loss = 0.734521 (* 1 = 0.734521 loss)
I0704 11:48:37.224179 87435 solver.cpp:347] Optimization Done.
I0704 11:48:37.224186 87435 caffe.cpp:234] Optimization Done.
138.60user 30.86system 1:23.55elapsed 202%CPU (0avgtext+0avgdata 678784maxresident)k
16inputs+6016outputs (0major+24918minor)pagefaults 0swaps

위 training은 아래의 'nvidia-smi -l 5' 명령으로 모니터링한 결과처럼, GPU 1장을 이용합니다. Default로 caffe는 무조건 첫번째 GPU에 job을 던집니다. (여기서는 GPU 2를 첫번째 GPU로 인식하네요.)

Tue Jul 4 11:48:09 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 39C P8 26W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 36C P8 30W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0004:03:00.0 Off | 0 |
| N/A 57C P0 129W / 149W | 216MiB / 11441MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0004:04:00.0 Off | 0 |
| N/A 38C P8 29W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 87308 C ./bin/caffe 214MiB |
+-----------------------------------------------------------------------------+

이번에는 CPU를 2장씩 사용하여 training하도록 해보겠습니다. caffe 명령에서 -gpu 옵션을 쓰도록 train_quick.sh 스크립트를 수정해줍니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo vi ./examples/cifar10/train_quick.sh
if [ -z "$CAFFE_BIN" ]; then
# TOOLS=./build/tools
TOOLS=./bin
else
TOOLS=$CAFFE_BIN
fi

$TOOLS/caffe train -gpu 0,1 \
--solver=examples/cifar10/cifar10_quick_solver.prototxt

# reduce learning rate by factor of 10 after 8 epochs
$TOOLS/caffe train -gpu 0,1 \
--solver=examples/cifar10/cifar10_quick_solver_lr1.prototxt \
--snapshot=examples/cifar10/cifar10_quick_iter_4000.solverstate.h5

이제 수행해보면 2장씩 쓰는 것을 보실 수 있습니다. GPU 2, 3을 쓰는군요.

test@ubuntu02:/opt/DL/caffe-nv$ sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
...
I0704 11:51:57.520256 87780 solver.cpp:242] Iteration 4800 (94.5059 iter/s, 1.05814s/100 iter), loss = 0.429975
I0704 11:51:57.520298 87780 solver.cpp:261] Train net output #0: loss = 0.429975 (* 1 = 0.429975 loss)
I0704 11:51:57.520318 87780 sgd_solver.cpp:106] Iteration 4800, lr = 0.0001
I0704 11:51:58.578877 87780 solver.cpp:242] Iteration 4900 (94.4687 iter/s, 1.05855s/100 iter), loss = 0.631555
I0704 11:51:58.578930 87780 solver.cpp:261] Train net output #0: loss = 0.631555 (* 1 = 0.631555 loss)
I0704 11:51:58.578975 87780 sgd_solver.cpp:106] Iteration 4900, lr = 0.0001
I0704 11:51:59.628901 87780 solver.cpp:489] Snapshotting to HDF5 file examples/cifar10/cifar10_quick_iter_5000.caffemodel.h5
I0704 11:51:59.630488 87780 sgd_solver.cpp:283] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_quick_iter_5000.solverstate.h5
I0704 11:51:59.633839 87780 solver.cpp:342] Iteration 5000, loss = 0.444928
I0704 11:51:59.633874 87780 solver.cpp:362] Iteration 5000, Testing net (#0)
I0704 11:52:00.025651 87780 solver.cpp:429] Test net output #0: accuracy = 0.7373
I0704 11:52:00.025693 87780 solver.cpp:429] Test net output #1: loss = 0.784022 (* 1 = 0.784022 loss)
I0704 11:52:00.025703 87780 solver.cpp:347] Optimization Done.
I0704 11:52:00.041434 87780 caffe.cpp:234] Optimization Done.
162.54user 28.23system 1:02.22elapsed 306%CPU (0avgtext+0avgdata 997696maxresident)k
0inputs+6016outputs (0major+36442minor)pagefaults 0swaps

Tue Jul 4 11:51:21 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 39C P8 26W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 36C P8 30W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0004:03:00.0 Off | 0 |
| N/A 50C P0 113W / 149W | 191MiB / 11441MiB | 92% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0004:04:00.0 Off | 0 |
| N/A 49C P0 128W / 149W | 152MiB / 11441MiB | 92% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 87633 C ./bin/caffe 189MiB |
| 3 87633 C ./bin/caffe 150MiB |
+-----------------------------------------------------------------------------+

이번에는 이 script를 2번 연속으로 수행해보겠습니다. 여기서는 스크립트 안에 -gpu 0,1이라고 지정되어 있으므로, 두 job이 모두 같은 2개의 GPU를 이용하려고 들 것입니다. 이럴 경우 어떻게 될까요 ? 위에서 보시다시피 GPU당 메모리는 180MB 정도만 사용하므로 2개 job이 동시에 돌아도 문제는 없을 것처럼 보입니다.

Wed Jul 5 10:42:55 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 39C P8 27W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 36C P8 30W / 149W | 2MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0004:03:00.0 Off | 0 |
| N/A 44C P0 75W / 149W | 383MiB / 11441MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0004:04:00.0 Off | 0 |
| N/A 44C P0 85W / 149W | 304MiB / 11441MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 7906 C ./bin/caffe 189MiB |
| 2 8023 C ./bin/caffe 189MiB |
| 3 7906 C ./bin/caffe 150MiB |
| 3 8023 C ./bin/caffe 150MiB |
+-----------------------------------------------------------------------------+

결론적으로는 이렇게 수행하면 일단 수행 시작은 됩니다만, 두 job이 모두 두배씩 더 오래 걸려 수행되는 것이 아니라 위와 같은 상태에서 아예 hang이 걸려 버립니다. 즉, 2개 job이 서로에게 lock을 걸어 버리는 것입니다.

이런 현상을 피하려면 지금 어느 GPU가 놀고 있는지 확인한 뒤 caffe를 수행하는 스크립트를 수정하여 놀고 있는 GPU 번호를 적어야 합니다. 여기서는 -gpu 0,1이 아니라 -gpu 2,3으로 적어야 하는 것이지요. 이렇게 하면 아래와 같이 잘 수행됩니다.

$TOOLS/caffe train -gpu 0,1 --solver=examples/cifar10/cifar10_quick_solver.prototxt

$TOOLS/caffe train -gpu 2,3 --solver=examples/cifar10/cifar10_quick_solver.prototxt

Wed Jul 5 11:09:03 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.119 Driver Version: 361.119 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 46C P0 111W / 149W | 191MiB / 11441MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 44C P0 124W / 149W | 152MiB / 11441MiB | 87% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0004:03:00.0 Off | 0 |
| N/A 48C P0 110W / 149W | 191MiB / 11441MiB | 92% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0004:04:00.0 Off | 0 |
| N/A 47C P0 127W / 149W | 152MiB / 11441MiB | 92% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 5343 C ./bin/caffe 189MiB |
| 1 5343 C ./bin/caffe 150MiB |
| 2 5221 C ./bin/caffe 189MiB |
| 3 5221 C ./bin/caffe 150MiB |
+-----------------------------------------------------------------------------+

그러나 이와 같이 일일이 GPU 상황을 모니터링하고 그에 따라 수행 스크립트를 고친다는 것은 당연히 불편한 일입니다. LSF를 이용하는 이유가 그런 모니터링 없이도, 그저 수행 스크립트 대충 짜서 submit하면 알아서 스케쥴링을 해주기를 바라기 때문인데, 일일이 그 수행 스크립트를 수정하는 것은 곤란합니다.

특히 caffe는 특성상 -gpu 옵션을 안 쓰는 것도 문제입니다. -gpu 옵션을 안 쓸 경우, 무조건 첫번째 GPU로 job이 assign 되거든요. 따라서 caffe에서 -gpu 옵션을 쓰지 않는다면 수작업으로 job을 직접 수행하든 LSF로 수행하든 다 error가 날 수 밖에 없습니다.

이 문제의 해결을 위해서는 아래와 같은 과정을 거쳐야 합니다. 먼저, GPU의 compute mode를 default(shared) mode에서 exclusive mode로 변경해주어야 합니다.

test@ubuntu02:~$ nvidia-smi -q | grep -i compute
Compute Mode : Default
Compute Mode : Default
Compute Mode : Default
Compute Mode : Default

Document를 보면 compute mode 1은 EXCLUSIVE_THREAD라고 되어있는데, CUDA 8.0에서는 그 mode는 depreciated 되었다면서 그냥 EXCLUSIVE_PROCESS (3)으로 설정하네요.

test@ubuntu02:~$ sudo nvidia-smi -c 1
Warning: Exclusive_Thread was deprecated! Setting Exclusive_Process instead.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0002:03:00.0.
Warning: Exclusive_Thread was deprecated! Setting Exclusive_Process instead.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0002:04:00.0.
Warning: Exclusive_Thread was deprecated! Setting Exclusive_Process instead.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0004:03:00.0.
Warning: Exclusive_Thread was deprecated! Setting Exclusive_Process instead.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0004:04:00.0.
All done.

참고로 compute mode 2는 PROHIBITED, 즉 연산은 아예 못 하게 막는 모드입니다.

test@ubuntu02:~$ sudo nvidia-smi -c 2
Set compute mode to PROHIBITED for GPU 0002:03:00.0.
Set compute mode to PROHIBITED for GPU 0002:04:00.0.
Set compute mode to PROHIBITED for GPU 0004:03:00.0.
Set compute mode to PROHIBITED for GPU 0004:04:00.0.
All done.

실제적으로는 그냥 3번 mode를 택하셔야 합니다. 어차피 1번 mode를 택해도 둘다 EXCLUSIVE_PROCESS로 setting 됩니다. 이 모드는 reboot하면 없어지므로, 영구히 setup하기 위해서는 /etc/rc.local 등에 등록해야 합니다.

test@ubuntu02:~$ sudo nvidia-smi -c 3
Set compute mode to EXCLUSIVE_PROCESS for GPU 0002:03:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0002:04:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0004:03:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0004:04:00.0.
All done.

이제 gpu 0에서 training이 돌고 있는데 두번째 training에서 동일한 gpu 0을 쓰려고 하면 나중에 수행된 job은 아래와 같이 error가 발생하면서 fail나는 것을 보실 수 있습니다. 먼저 수행되던 것은 영향을 받지 않고 정상적으로 수행됩니다.

test@ubuntu02:/opt/DL/caffe-nv$ sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
F0705 11:32:12.339743 7616 gpu_memory.cpp:168] Check failed: error == cudaSuccess (46 vs. 0) all CUDA-capable devices are busy or unavailable
*** Check failure stack trace: ***
@ 0x3fff9f28ce0c google::LogMessage::Fail()
@ 0x3fff9f28f284 google::LogMessage::SendToLog()
@ 0x3fff9f28c768 google::LogMessage::Flush()
@ 0x3fff9f2911c4 google::LogMessageFatal::~LogMessageFatal()
@ 0x3fff9f5d9c50 caffe::GPUMemory::Manager::update_dev_info()
@ 0x3fff9f5daf74 caffe::GPUMemory::Manager::init()
@ 0x1000b128 (unknown)
@ 0x10007b54 (unknown)
@ 0x3fff9e97309c (unknown)
@ 0x3fff9e973298 __libc_start_main
@ (nil) (unknown)
Aborted (core dumped)
F0705 11:32:12.533152 7620 gpu_memory.cpp:168] Check failed: error == cudaSuccess (46 vs. 0) all CUDA-capable devices are busy or unavailable
*** Check failure stack trace: ***
@ 0x3fff9fb7ce0c google::LogMessage::Fail()
@ 0x3fff9fb7f284 google::LogMessage::SendToLog()
@ 0x3fff9fb7c768 google::LogMessage::Flush()
@ 0x3fff9fb811c4 google::LogMessageFatal::~LogMessageFatal()
@ 0x3fff9fec9c50 caffe::GPUMemory::Manager::update_dev_info()
@ 0x3fff9fecaf74 caffe::GPUMemory::Manager::init()
@ 0x1000b128 (unknown)
@ 0x10007b54 (unknown)
@ 0x3fff9f26309c (unknown)
@ 0x3fff9f263298 __libc_start_main
@ (nil) (unknown)
Aborted (core dumped)
Command exited with non-zero status 134
0.07user 0.08system 0:00.37elapsed 42%CPU (0avgtext+0avgdata 64512maxresident)k
0inputs+1024outputs (0major+3069minor)pagefaults 0swaps

이번에는 LSF로 caffe job을 submit 해보겠습니다. 별다른 옵션 없이, 그냥 bsub 명령을 앞에 붙이기만 하면 됩니다.

test@ubuntu02:/opt/DL/caffe-nv$ bsub sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <220> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bsub sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <221> is submitted to default queue <normal>.

Submit은 잘 되었으나, 실제로 job이 잘 돌아가는지 봐야지요. 이는 bhist 명령으로 볼 수 있습니다. 당연한 일이지만, 일단 첫번째로 submit한 job은 잘 완료되었습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 220

Job <220>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/D
L/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_quick.sh>
Wed Jul 5 11:34:11: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>;
Wed Jul 5 11:34:12: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[typ
e == local] order[r15s:pg] >;
Wed Jul 5 11:34:12: Starting (Pid 7824);
Wed Jul 5 11:34:12: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <7824>;
Wed Jul 5 11:35:43: Done successfully. The CPU time used is 183.6 seconds;
Wed Jul 5 11:35:45: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 271 Mbytes; AVG MEM: 269 Mbytes

Summary of time in seconds spent in various states by Wed Jul 5 11:35:45
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 91 0 0 0 92

문제는 두번째 job인데, 역시 안 되었습니다. Exit code 134, 그러니까 수작업으로 돌렸을 때와 동일한 error가 납니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 221

Job <221>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/D
L/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_quick.sh>
Wed Jul 5 11:35:15: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>;
Wed Jul 5 11:35:15: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[typ
e == local] order[r15s:pg] >;
Wed Jul 5 11:35:15: Starting (Pid 7963);
Wed Jul 5 11:35:15: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <7963>;
Wed Jul 5 11:35:18: Exited with exit code 134. The CPU time used is 0.7 second
s;
Wed Jul 5 11:35:18: Completed <exit>;

MEMORY USAGE:
MAX MEM: 37 Mbytes; AVG MEM: 37 Mbytes

Summary of time in seconds spent in various states by Wed Jul 5 11:35:18
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
0 0 3 0 0 0 3

이 error의 원인은 무엇일까요 ? 생각해보면 간단합니다. LSF에는 현재 GPU 자원에 대한 감시 체계도 갖춰져 있지 않고, 또 제가 job을 submit 할 때 GPU job에 대한 요구조건도 주지 않았습니다. 따라서, LSF는 그냥 기본 요구조건인 slot (CPU 자원) 상황만 보고 job을 배치한 것이고, 따라서 같은 GPU에 대해 2개 caffe job이 수행되는 결과를 낳은 것입니다.

이 문제를 해결하기 위한 첫번째 단계는 LSF에게 GPU 자원을 인식하고 모니터링하게 등록하는 것입니다.

맨 먼저, LSF 10.1에서는 어떤 GPU 자원 항목이 있는지 보시겠습니다. 이를 위해 elim.gpu 명령을 수행하겠습니다. 이 명령은 /usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc 밑에 존재하고, 따로 종료되지 않으므로 그냥 control-C로 끊어주셔야 합니다.

test@ubuntu02:/opt/DL/caffe-nv$ elim.gpu
4 ngpus 4 ngpus_shared 0 ngpus_excl_t 0 ngpus_excl_p 4
^C

맨 앞에 나오는 숫자 4는 4개 parameter가 display된다는 뜻이고, GPU 개수(ngpus)가 4, shared mode의 GPU(ngpus_shared)가 0, exclusive thread mode의 GPU(ngpus_excl_t)가 0, 끝으로 exclusive process mode의 GPU(ngpus_excl_p)가 4개 있다는 뜻입니다. 이는 바로 위에서 제가 GPU의 compute mode를 3, 즉 EXCLUSIVE_PROCESS 로 설정했기 때문에 이렇게 나오는 것입니다.

이제 이 항목을 lsf에 등록하겠습니다. LSF conf directory에 가서 lsf.shared 파일을 수정하면 되는데, 기존 stanza를 보면 Begin Resource와 End Resource 사이에 mips니 sparc이니 하는 항목이 보이고, aix라는 이름도 보입니다.

test@ubuntu02:/usr/share/lsf/10.1/linux3.10-glibc2.17-ppc64le/etc$ cd $LSF_ENVDIR

test@ubuntu02:/usr/share/lsf/conf$ vi lsf.shared

Begin Resource
mips Boolean () () (MIPS architecture)
sparc Boolean () () (SUN SPARC)
hpux Boolean () () (HP-UX UNIX)
aix Boolean () () (AIX UNIX)
irix Boolean () () (IRIX UNIX)
... 중략 ...
openmpi Boolean () () (OPENMPI)
bluegene Boolean () () (BLUEGENE)
define_ncpus_procs Boolean () () (ncpus := procs)
define_ncpus_cores Boolean () () (ncpus := cores)
define_ncpus_threads Boolean () () (ncpus := threads)
vnode Boolean () () (Simulation node used by integrations for example Cray Linux)
craylinux Boolean () () (Cray Linux Environment: CRAY XT/XE login nodes and compute nodes)
gpu Boolean () () (gpu availability)
End Resource

이 항목들은 그대로 내버려 두시고, 그 밑에 아래와 같은 새로운 Begin Resource ~ End Resource stanza를 삽입해줍니다.

Begin Resource

RESOURCENAME TYPE INTERVAL INCREASING CONSUMABLE DESCRIPTION # Keywords
ngpus Numeric 60 N N (Number of GPUs)
ngpus_shared Numeric 60 N N (Number of GPUs in Shared Mode)
ngpus_excl_t Numeric 60 N Y (Number of GPUs in Exclusive thread Mode)
ngpuprohibited Numeric 60 N N (Number of GPUs in Prohibited Mode)
ngpus_excl_p Numeric 60 N Y (Number of GPUs in Exclusive process Mode)

End Resource

이어서 lsf.cluster."cluster_name" 파일도 수정해줍니다. 여기서 제 cluster의 이름은 firestone입니다. 역시 기존 항목들은 내버려두시고, 아래의 Begin ResourceMap ~ End ResourceMap 부분을 추가해줍니다.

test@ubuntu02:/usr/share/lsf/conf$ vi lsf.cluster.firestone
...
Begin ResourceMap
RESOURCENAME LOCATION
ngpus ([default])
ngpus_shared ([default])
ngpus_excl_t ([default])
ngpuprohibited ([default])
ngpus_excl_p ([default])
End ResourceMap

이제 reconfig를 합니다.

test@ubuntu02:/usr/share/lsf/conf$ lsadmin reconfig

Checking configuration files ...
No errors found.

Restart only the master candidate hosts? [y/n] n
Do you really want to restart LIMs on all hosts? [y/n] y
Restart LIM on <ubuntu02> ...... done

test@ubuntu02:/usr/share/lsf/conf$ badmin reconfig

Checking configuration files ...

No errors found.

Reconfiguration initiated

이제 bhosts -l 명령을 내려 봅니다.

test@ubuntu02:~$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 0 0 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 0% 0.0 29 1 0 782G 37.6G 125G 16 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 4.0
Reserved 0.0 0.0 - 0.0

LOAD THRESHOLD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
loadSched - - - - -
loadStop - - - - -

CONFIGURED AFFINITY CPU LIST: all

방금 제가 등록한 ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p 항목들이 모니터링 되는 것을 보실 수 있습니다.

이제 bsub 명령만 붙여서 caffe job을 submit 해보겠습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bsub sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <1033> is submitted to default queue <normal>.

아래 보시다시피 이 job 자체는 잘 돌아갑니다. 상태가 RUN인 것을 확인하십시요.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1033 test RUN normal ubuntu02 ubuntu02 *_quick.sh Jul 7 10:27

그러나 bhosts -l 명령으로 보면, ngpus_excl_p가 여전히 total 4개로 보이고, Reserved 항목은 0으로 되어 있는 것을 보실 수 있습니다. 이때 실제로 nvidia-smi 명령으로 보면 GPU 1개가 caffe를 열심히 수행하고 있는데도 그렇습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 1 1 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 0% 0.0 58 1 0 782G 37.6G 124.6G 15 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 4.0
Reserved 0.0 0.0 - 0.0

LOAD THRESHOLD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
loadSched - - - - -
loadStop - - - - -

CONFIGURED AFFINITY CPU LIST: all

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 1033

Job <1033>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/
DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_quick.sh>
Fri Jul 7 10:27:55: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>;
Fri Jul 7 10:27:55: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[typ
e == local] order[r15s:pg] >;
Fri Jul 7 10:27:55: Starting (Pid 14241);
Fri Jul 7 10:27:55: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <14241>;

Summary of time in seconds spent in various states by Fri Jul 7 10:28:23
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
0 0 28 0 0 0 28

이 상황에서 caffe job을 하나 더 넣을 경우, 아래와 같이 exit code 134와 함께 error가 납니다. 즉, caffe가 default 거동에 따라 첫번째 GPU에 또 caffe job을 배치하므로 error가 나면서 job이 죽는 것입니다. 이와 같은 상황은 LSF가 이렇게 submit된 job을 GPU 자원을 필요로 하는 job이라고 인식 못 하기 때문에 발생하는 것입니다.

test@ubuntu02:/opt/DL/caffe-nv$ bsub sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <1034> is submitted to default queue <normal>.
test@ubuntu02:/opt/DL/caffe-nv$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1033 test RUN normal ubuntu02 ubuntu02 *_quick.sh Jul 7 10:27
test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 1034

Job <1034>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/
DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_quick.sh>
Fri Jul 7 10:28:40: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>;
Fri Jul 7 10:28:41: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[typ
e == local] order[r15s:pg] >;
Fri Jul 7 10:28:41: Starting (Pid 14363);
Fri Jul 7 10:28:41: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <14363>;
Fri Jul 7 10:28:42: Exited with exit code 134. The CPU time used is 0.2 second
s;
Fri Jul 7 10:28:42: Completed <exit>;

Summary of time in seconds spent in various states by Fri Jul 7 10:28:42
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 1 0 0 0 2

이를 해결하기 위해서는 job을 submit할 때, 이것이 GPU 자원을 필요로 하는 것이고, 그에 따라 배정되어야 한다는 것을 LSF에게 알려야 합니다. 그것이 바로 select와 rusage 옵션입니다.

아래의 예에서, select[ngpus>0]는 gpu가 1개 이상인 서버에 job을 assign하라는 뜻이고, rusage[ngpus_excl_p=1]는 이 job이 EXCLUSIVE_PROCESS 모드의 GPU를 1개 사용한다는 뜻입니다.

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=1]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <1153> is submitted to default queue <normal>.

이렇게 옵션을 주면, bhost 명령으로 볼 때 ngpus_excl_p 항목의 값이 4에서 3으로 줄고, 대신 그 밑의 Reserved 항목 값이 1로 바뀐 것을 보실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 1 1 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 0% 0.0 123 1 0 781G 37.6G 124.1G 15 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 3.0
Reserved 0.0 0.0 - 1.0

이 상태에서 두번째 job을 던지면 어떻게 될까요 ?

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=1]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_quick.sh
Job <1154> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 1 1 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 1% 0.0 28 1 0 781G 37.6G 123.7G 15 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 2.0
Reserved 0.0 0.0 - 2.0

보시다시피 ngpus_excl_p 개수가 2개로 줄고, Reserved가 2로 늘어난 것을 보실 수 있습니다. 즉, 이제 LSF가 caffe job을 default로 던지는 것이 아니라, 이미 점거된 GPU는 환경에서 빼고 던지는 것입니다 !

bjobs 명령으로 첫번째 job id인 1153을 살펴 보겠습니다. 저 아래에 ubuntu02:gpus=2 라고 gpu 2번이 할당된 것을 보실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs -l 1153

Job <1153>, User <test>, Project <default>, Status <RUN>, Queue <normal>, Comma
nd <sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl
/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_qu
ick.sh>, Share group charged </test>
Fri Jul 7 17:34:02: Submitted from host <ubuntu02>, CWD </opt/DL/caffe-nv>, Re
quested Resources <select[ngpus>0] rusage[ngpus_excl_p=1]>
;
Fri Jul 7 17:34:02: Started 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Slot(
s) on Host(s) <ubuntu02>, Execution Home </home/test>, Exe
cution CWD </opt/DL/caffe-nv>;
Fri Jul 7 17:35:01: Resource usage collected.
The CPU time used is 117 seconds.
MEM: 277 Mbytes; SWAP: 0 Mbytes; NTHREAD: 78
PGID: 12981; PIDs: 12981 12985 12987 12988 12989 12990

MEMORY USAGE:
MAX MEM: 277 Mbytes; AVG MEM: 274 Mbytes

SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -

ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
loadSched - - - - -
loadStop - - - - -

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 test Jul 7 17:34 ubuntu02:gpus=2; N

RESOURCE REQUIREMENT DETAILS:
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
l_p=1.00]
Effective: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_ex
cl_p=1.00]

bjobs 명령으로 두번째 job id인 1154를 살펴 보겠습니다. 저 아래에 ubuntu02:gpus=3 이라고 gpu 3번이 할당된 것을 보실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs -l 1154

Job <1154>, User <test>, Project <default>, Status <EXIT>, Queue <normal>, Comm
and <sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/ncc
l/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_q
uick.sh>, Share group charged </test>
Fri Jul 7 17:34:04: Submitted from host <ubuntu02>, CWD </opt/DL/caffe-nv>, Re
quested Resources <select[ngpus>0] rusage[ngpus_excl_p=1]>
;
Fri Jul 7 17:34:04: Started 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Slot(
s) on Host(s) <ubuntu02>, Execution Home </home/test>, Exe
cution CWD </opt/DL/caffe-nv>;
...

EXTERNAL MESSAGES:
MSG_ID FROM POST_TIME MESSAGE ATTACHMENT
0 test Jul 7 17:34 ubuntu02:gpus=3; N

RESOURCE REQUIREMENT DETAILS:
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
l_p=1.00]
Effective: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_ex
cl_p=1.00]

이번에는 GPU를 2개씩 사용하도록 caffe 명령에 -gpu 옵션을 붙여 보겠습니다. 아래처럼 -gpu 0,1 이라고 지정해놓으면 gpu0과 gpu1을 지정해서 사용하게 됩니다.

test@ubuntu02:/opt/DL/caffe-nv$ cat ./examples/cifar10/train_01.sh
#!/usr/bin/env sh

# Check if CAFFE_BIN is unset
if [ -z "$CAFFE_BIN" ]; then
# TOOLS=./build/tools
TOOLS=./bin
else
TOOLS=$CAFFE_BIN
fi

$TOOLS/caffe train -gpu 0,1 \
--solver=examples/cifar10/cifar10_quick_solver.prototxt

# reduce learning rate by factor of 10 after 8 epochs
$TOOLS/caffe train -gpu 0,1 \
--solver=examples/cifar10/cifar10_quick_solver_lr1.prototxt \
--snapshot=examples/cifar10/cifar10_quick_iter_4000.solverstate.h5

이 script를 연달아 2번 돌려보겠습니다. 0번 1번 GPU라고 지정했으니, 같은 GPU 2개를 두개의 job이 서로 점유하려고 할까요 ?

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1155> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1156> is submitted to default queue <normal>.

아래처럼 bhosts에서 ngpus_excl_p가 0으로, Reserved가 4로 변한 것을 보실 수 있습니다. 즉, gpu0, gpu1이 이미 첫번째 job에 의해 점유된 것을 보고, LSF가 두번째 job은 gpu2, gpu3에 할당한 것입니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhosts -l
HOST ubuntu02
STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW
ok 250.00 - 16 1 1 0 0 0 -

CURRENT LOAD USED FOR SCHEDULING:
r15s r1m r15m ut pg io ls it tmp swp mem slots ngpus
Total 0.0 0.0 0.0 0% 0.0 11 1 0 781G 37.6G 124.1G 15 4.0
Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M - 0.0

ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
Total 0.0 0.0 0.0 0.0
Reserved 0.0 0.0 - 4.0

bjobs 명령으로 보면 좀더 확실히 확인하실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs -l 1155 | grep gpu
quested Resources <select[ngpus>0] rusage[ngpus_excl_p=2]>
ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
0 test Jul 7 17:39 ubuntu02:gpus=2,3; N
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
Effective: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_ex

test@ubuntu02:/opt/DL/caffe-nv$ bjobs -l 1156 | grep gpu
quested Resources <select[ngpus>0] rusage[ngpus_excl_p=2]>
ngpus ngpus_shared ngpus_excl_t ngpuprohibited ngpus_excl_p
0 test Jul 7 17:39 ubuntu02:gpus=0,1; N
Combined: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
Effective: select[(ngpus>0) && (type == local)] order[r15s:pg] rusage[ngpus_ex

이번에는 이렇게 GPU 2개를 사용하는 job을 연달아 3번 submit하면 어떻게 될까요 ? Error가 날까요 ?

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1269> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1270> is submitted to default queue <normal>.

test@ubuntu02:/opt/DL/caffe-nv$ bsub -R "select[ngpus>0] rusage[ngpus_excl_p=2]" sudo LD_LIBRARY_PATH=/opt/DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time ./examples/cifar10/train_01.sh
Job <1271> is submitted to default queue <normal>.

아닙니다. 첫번째와 두번째 job들이 GPU 자원을 2개씩 다 사용하므로 세번째 job은 당장 가용한 GPU 자원이 없게 되는데, 이 경우 그냥 PENDING 상태에서 다른 job들이 다 종료되어 GPU 자원이 풀려나기를 기다리게 됩니다.

test@ubuntu02:/opt/DL/caffe-nv$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1269 test RUN normal ubuntu02 ubuntu02 *ain_01.sh Jul 7 18:02
1270 test RUN normal ubuntu02 ubuntu02 *ain_01.sh Jul 7 18:02
1271 test PEND normal ubuntu02 *ain_01.sh Jul 7 18:02

모든 job들이 다 완료된 이후, bhist 명령으로 job들의 history를 보겠습니다. 두번째 수행된 job 1270의 경우 16초만 PENDING 상태에 있다가 곧장 dispatch되어 GPU를 사용하기 시작했습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 1270

Job <1270>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/
DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_01.sh>
Fri Jul 7 18:02:11: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>, Requested Resources <select[ngpus>0] rusa
ge[ngpus_excl_p=2]>;
Fri Jul 7 18:02:27: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[(ng
pus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
l_p=2.00] >;
Fri Jul 7 18:02:27: Starting (Pid 5830);
Fri Jul 7 18:02:27: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <5830>;
Fri Jul 7 18:02:28: External Message "ubuntu02:gpus=0,1;" was posted from "tes
t" to message box 0;
Fri Jul 7 18:03:44: Done successfully. The CPU time used is 222.3 seconds;
Fri Jul 7 18:03:45: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 517 Mbytes; AVG MEM: 467 Mbytes

Summary of time in seconds spent in various states by Fri Jul 7 18:03:45
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
16 0 77 0 0 0 93

그러나 세번째로 수행된 job 1271의 경우, 첫번째 job인 job 1269가 끝날 때까지 약 86초 동안 PENDING 상태에 있다가 dispatch된 것을 보실 수 있습니다.

test@ubuntu02:/opt/DL/caffe-nv$ bhist -l 1271

Job <1271>, User <test>, Project <default>, Command <sudo LD_LIBRARY_PATH=/opt/
DL/openblas/lib:/opt/DL/nccl/lib:/opt/DL/caffe-nv/lib time
./examples/cifar10/train_01.sh>
Fri Jul 7 18:02:15: Submitted from host <ubuntu02>, to Queue <normal>, CWD </o
pt/DL/caffe-nv>, Requested Resources <select[ngpus>0] rusa
ge[ngpus_excl_p=2]>;
Fri Jul 7 18:03:41: Dispatched 1 Task(s) on Host(s) <ubuntu02>, Allocated 1 Sl
ot(s) on Host(s) <ubuntu02>, Effective RES_REQ <select[(ng
pus>0) && (type == local)] order[r15s:pg] rusage[ngpus_exc
l_p=2.00] >;
Fri Jul 7 18:03:42: Starting (Pid 6729);
Fri Jul 7 18:03:42: Running with execution home </home/test>, Execution CWD </
opt/DL/caffe-nv>, Execution Pid <6729>;
Fri Jul 7 18:04:52: Done successfully. The CPU time used is 207.9 seconds;
Fri Jul 7 18:04:53: Post job process done successfully;

MEMORY USAGE:
MAX MEM: 530 Mbytes; AVG MEM: 463 Mbytes

Summary of time in seconds spent in various states by Fri Jul 7 18:04:53
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
86 0 71 0 0 0 157

이제 기본적인 GPU를 이용한 Deep Learning용 LSF 환경이 준비된 것입니다.

댓글 2개:

jasminOlivia2017년 12월 14일 오후 10:28
Really its very useful information that you have shared and thanks for sharing the information with us.
123 HP Envy 126 Printer Setup
답글삭제
답글
익명2022년 3월 13일 오후 12:58
Hw 엔지니어를 위한 Deep Learning: Gpu를 이용하는 Caffe Training을 위한 Lsf 환경 Setup >>>>> Download Now

>>>>> Download Full

Hw 엔지니어를 위한 Deep Learning: Gpu를 이용하는 Caffe Training을 위한 Lsf 환경 Setup >>>>> Download LINK

>>>>> Download Now

Hw 엔지니어를 위한 Deep Learning: Gpu를 이용하는 Caffe Training을 위한 Lsf 환경 Setup >>>>> Download Full

>>>>> Download LINK
답글삭제
답글

댓글 추가

2017년 7월 7일 금요일

GPU를 이용하는 Caffe training을 위한 LSF 환경 setup

댓글 2개: