2018년 2월 28일 수요일

AC922의 기본 setup 및 anaconda virutalenv, tensorflow 등의 설치

먼저, AC922에서 CUDA 버전이 아직 불안정한 관계로 일부 손볼 부분이 있습니다.  혹시 그것이 아직 설정되어 있지 않다면 아래 link를 참조하시어 설정하시기 바랍니다.  설정이 제대로 되었다면 "nvidia-smi -l 3" 명령으로 볼 때 아무 문제가 없을 것입니다.  혹시 GPU 사용%에 UNKNOW ERROR 등이 표시된다면 그 설정이 안 된 것입니다.

https://hwengineer.blogspot.kr/2018/01/ac922-ubuntu-container-image-caffe.html

ppc64le 환경에서 EPEL의 rpm은 다음과 같이 download 받습니다.

[root@ac922 ~]$ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

이걸 다음과 같이 설치하면 /etc/yum.repos.d/epel.repo 등 epel repository 관련 file들이 설치됩니다.

[root@ac922 ~]# rpm -Uvh epel-release-latest-7.noarch.rpm

[root@ac922 ~]# rpm -ql epel-release-7-11.noarch
/etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
/etc/yum.repos.d/epel-testing.repo
/etc/yum.repos.d/epel.repo
/usr/lib/systemd/system-preset/90-epel.preset
/usr/share/doc/epel-release-7
/usr/share/doc/epel-release-7/GPL

/etc/yum.repos.d/epel.repo의 내용을 보면 아래와 같습니다.

[root@ac922 ~]# vi /etc/yum.repos.d/epel.repo
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
#baseurl=http://download.fedoraproject.org/pub/epel/7/$basearch
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-7&arch=$basearch
failovermethod=priority
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
#baseurl=http://download.fedoraproject.org/pub/epel/7/$basearch/debug
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-debug-7&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
#baseurl=http://download.fedoraproject.org/pub/epel/7/SRPMS
metalink=https://mirrors.fedoraproject.org/metalink?repo=epel-source-7&arch=$basearch
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=1

이제 yum update를 수행한 뒤 yum list를 해보면 epel repository에 python34.ppc64le 등이 보입니다.  이걸 설치하시면 됩니다.

[root@ac922 ~]# yum update

[root@ac922 ~]# yum list | grep python34 | more
python34.ppc64le                           3.4.5-5.el7             epel
python34-libs.ppc64le                      3.4.5-5.el7             epel
libpeas-loader-python34.ppc64le            1.20.0-1.el7            epel
python34-Cython.ppc64le                    0.23.5-1.el7            epel
...

[root@ac922 ~]# yum install python34
...
Running transaction
  Installing : python34-3.4.5-5.el7.ppc64le                                                             1/2
  Installing : python34-libs-3.4.5-5.el7.ppc64le                                                        2/2
  Verifying  : python34-libs-3.4.5-5.el7.ppc64le                                                        1/2
  Verifying  : python34-3.4.5-5.el7.ppc64le                                                             2/2

Installed:
  python34.ppc64le 0:3.4.5-5.el7

Dependency Installed:
  python34-libs.ppc64le 0:3.4.5-5.el7

Complete!

다만 보통 Redhat OS에 포함된 python34를 이용하기보다는 Anaconda를 많이 이용합니다.
Anaconda 설치는 이미 되어 있다고 가정하겠습니다.  안 되어 있다면 다음과 같이 받아서 저 file을 수행하시면 됩니다.

$ wget https://repo.continuum.io/archive/Anaconda2-5.1.0-Linux-ppc64le.sh
$ wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-ppc64le.sh

여기서는 Anaconda2를 사용한다고 가정하겠습니다.

[user1@ac922 ~]$ which conda
~/anaconda2/bin/conda

virutalenv는 다음과 같이 x86과 동일하게 생성하시면 됩니다.  여기서는 python 3.6 환경을 만들겠습니다.

[user1@ac922 ~]$ conda create -n env_py36 python=3.6 anaconda
...
isort 4.2.15: ###################################################################################### | 100%
six 1.11.0: ######################################################################################## | 100%
conda-build 3.4.1: ################################################################################# | 100%
numpy 1.13.3: ###################################################################################### | 100%
click 6.7: ######################################################################################### | 100%
send2trash 1.4.2: ################################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use:
# > source activate env_py36
#
# To deactivate an active environment, use:
# > source deactivate
#

virtualenv의 사용법도 x86에서와 동일합니다.  다만 pip 등의 이름이 pip3가 아니라 그냥 pip일 뿐입니다.  pip 버전은 물론 python 3.6에 딸린 버전입니다.

[user1@ac922 ~]$ source activate env_py36

(env_py36) [user1@ac922 ~]$ which python
~/anaconda2/envs/env_py36/bin/python

(env_py36) [user1@ac922 ~]$ which pip
~/anaconda2/envs/env_py36/bin/pip

(env_py36) [user1@ac922 ~]$ pip --version
pip 9.0.1 from /home/user1/anaconda2/envs/env_py36/lib/python3.6/site-packages (python 3.6)

이제 jupyterhub나 sudospawner 등 필요하신 것을 그대로 설치하시면 됩니다.

(env_py36) [user1@ac922 ~]$ pip install jupyterhub
...
Successfully built alembic python-oauth2 Mako python-editor
Installing collected packages: pamela, Mako, python-editor, alembic, python-oauth2, jupyterhub
Successfully installed Mako-1.0.7 alembic-0.9.8 jupyterhub-0.8.1 pamela-0.3.0 python-editor-1.0.3 python-oauth2-1.1.0

(env_py36) [user1@ac922 ~]$ pip install sudospawner
...
Requirement already satisfied: parso==0.1.* in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from jedi>=0.10->ipython>=4.0.0->ipykernel->notebook->sudospawner)
Requirement already satisfied: wcwidth in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from prompt_toolkit<2.0.0,>=1.0.4->ipython>=4.0.0->ipykernel->notebook->sudospawner)
Installing collected packages: sudospawner
Successfully installed sudospawner-0.5.1


다만, tensorflow도 그대로 설치하시려 하면 아래처럼 error가 날 겁니다. 

(env_py36) [user1@ac922 ~]$ pip install tensorflow
Collecting tensorflow
  Could not find a version that satisfies the requirement tensorflow (from versions: )
No matching distribution found for tensorflow

이는 아직 ppc64le 환경에서는 CUDA 9을 위한 tensorflow 1.4.1은 없기 때문입니다.  대신 제가 source에서 빌드해놓은 wheel file이 있습니다.  (혹시 필요하실 경우) tensorflow 1.4.1을 source에서 빌드하는 방법은 이 link를 참조하시면 됩니다.  제가 빌드한 tensorflow-1.4.1-cp36-cp36m-linux_ppc64le.whl 파일은 여기서 download 받으시면 됩니다.

(env_py36) [user1@ac922 ~]$ ls files
...
tensorflow-1.4.1-cp36-cp36m-linux_ppc64le.whl

이 wheel 파일을 다음과 같이 pip로 설치하시면 됩니다.

(env_py36) [user1@ac922 ~]$ pip install ./files/tensorflow-1.4.1-cp36-cp36m-linux_ppc64le.whl
Processing ./files/tensorflow-1.4.1-cp36-cp36m-linux_ppc64le.whl
Requirement already satisfied: six>=1.10.0 in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from tensorflow==1.4.1)
Requirement already satisfied: numpy>=1.12.1 in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from tensorflow==1.4.1)
Collecting protobuf>=3.3.0 (from tensorflow==1.4.1)
  Downloading protobuf-3.5.1-py2.py3-none-any.whl (388kB)
    100% |████████████████████████████████| 389kB 1.5MB/s
Collecting tensorflow-tensorboard<0.5.0,>=0.4.0rc1 (from tensorflow==1.4.1)
  Downloading tensorflow_tensorboard-0.4.0-py3-none-any.whl (1.7MB)
    100% |████████████████████████████████| 1.7MB 629kB/s
Collecting enum34>=1.1.6 (from tensorflow==1.4.1)
  Downloading enum34-1.1.6-py3-none-any.whl
Requirement already satisfied: wheel>=0.26 in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from tensorflow==1.4.1)
Requirement already satisfied: setuptools in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from protobuf>=3.3.0->tensorflow==1.4.1)
Collecting bleach==1.5.0 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.1)
  Downloading bleach-1.5.0-py2.py3-none-any.whl
Collecting html5lib==0.9999999 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.1)
  Downloading html5lib-0.9999999.tar.gz (889kB)
    100% |████████████████████████████████| 890kB 1.2MB/s
Collecting markdown>=2.6.8 (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.1)
  Downloading Markdown-2.6.11-py2.py3-none-any.whl (78kB)
    100% |████████████████████████████████| 81kB 1.5MB/s
Requirement already satisfied: werkzeug>=0.11.10 in ./anaconda2/envs/env_py36/lib/python3.6/site-packages (from tensorflow-tensorboard<0.5.0,>=0.4.0rc1->tensorflow==1.4.1)
Building wheels for collected packages: html5lib
  Running setup.py bdist_wheel for html5lib ... done
  Stored in directory: /home/user1/.cache/pip/wheels/6f/85/6c/56b8e1292c6214c4eb73b9dda50f53e8e977bf65989373c962
Successfully built html5lib
Installing collected packages: protobuf, html5lib, bleach, markdown, tensorflow-tensorboard, enum34, tensorflow
  Found existing installation: html5lib 1.0.1
    Uninstalling html5lib-1.0.1:
      Successfully uninstalled html5lib-1.0.1
  Found existing installation: bleach 2.1.2
    Uninstalling bleach-2.1.2:
      Successfully uninstalled bleach-2.1.2
Successfully installed bleach-1.5.0 enum34-1.1.6 html5lib-0.9999999 markdown-2.6.11 protobuf-3.5.1 tensorflow-1.4.1 tensorflow-tensorboard-0.4.0

(env_py36) [user1@ac922 ~]$ pip list | grep tensorflow
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
tensorflow (1.4.1)
tensorflow-tensorboard (0.4.0)

이 tensorflow의 이름이 tensorflow-gpu가 아니라고 해서 이것이 CPU 버전의 tensorlfow인 것은 아닙니다.   Source로부터 build한 것은 tensorlfow-gpu와 tensorflow가 따로 있진 않고, 이것도 GPU를 이용하도록 opt=cuda로 build한 것입니다.  다음과 같이 간단한 test로 tensorlfow가 V100 GPU를 제대로 물고 오는 것을 확인하실 수 있습니다.

(env_py36) [user1@ac922 ~]$ python
Python 3.6.4 |Anaconda, Inc.| (default, Feb 11 2018, 08:19:13)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess=tf.Session()
2018-02-28 15:08:07.063120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
totalMemory: 15.75GiB freeMemory: 15.33GiB
2018-02-28 15:08:07.432906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-02-28 15:08:07.856193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
totalMemory: 15.75GiB freeMemory: 15.33GiB
2018-02-28 15:08:08.288151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-02-28 15:08:08.292038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-02-28 15:08:08.295822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3
2018-02-28 15:08:08.295838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y N N
2018-02-28 15:08:08.295851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y N N
2018-02-28 15:08:08.295863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2:   N N Y Y
2018-02-28 15:08:08.295874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3:   N N Y Y
2018-02-28 15:08:08.295894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2018-02-28 15:08:08.295909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2018-02-28 15:08:08.295921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
2018-02-28 15:08:08.295933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)


이때 nvidia-smi로 다음과 같이 tensorflow가 GPU를 점유하는 것을 보실 수 있습니다.

Wed Feb 28 15:08:17 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.36                 Driver Version: 387.36                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   37C    P0    51W / 300W |  15340MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   42C    P0    52W / 300W |  15345MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   39C    P0    51W / 300W |  15343MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   43C    P0    52W / 300W |  15344MiB / 16128MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     10141      C   python                                     15336MiB |
|    1     10141      C   python                                     15334MiB |
|    2     10141      C   python                                     15332MiB |
|    3     10141      C   python                                     15338MiB |
+-----------------------------------------------------------------------------+

댓글 없음:

댓글 쓰기