Size: 2109
Comment:
|
← Revision 25 as of 2016-07-07 11:11:52 ⇥
Size: 6758
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
## page was renamed from GR2DOC/Tools/Grid/GPU ## page was renamed from Documentation/Tools/Grid/GPU ## page was renamed from Cluster/Operational/Tools/Grid/GPU #acl All:read |
|
Line 3: | Line 8: |
== Tweaks and applied configuration == | {i} Full documentation at [[Documentation/Tools/Grid/GPU/GPUsOnGrid|this page]]. == Underlying tale of installations, applied configuration and tweaks == === NVIDIA worker === 1. Blacklist `nouveau`. To avoid compilation errors (aka '''ERROR: Unable to load the kernel module 'nvidia.ko'...''') when installing NVIDIA driver, is often not enough to include `blacklist nouveau` in `/etc/modprobe.d/blacklist.conf`. It is also required to remove it from the initrd image like so: {{{ # echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/disable-nouveau.conf # mkinitrd -f -v /boot/initrd-$(uname -r).img $(uname -r) # or `dracut -f` }}} and reboot. 1. Install `./NVIDIA-Linux-x86_64-325.08.run` or whatever other version. 1. Install `./cudatoolkit_4.0.17_linux_64_rhel6.0.run` or whatever other version. 1. Check `nvidia-smi` command output. If not supported or NA information is found: {{{ (..) +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 Not Supported | | 1 Not Supported | | 2 Not Supported | | 3 Not Supported | +-----------------------------------------------------------------------------+ }}} we need to patch `libnvidia-ml.so.1` library: 1. Get patch from Github's [[https://github.com/CFSworks/nvml_fix|nvml_fix]] repository. 1. Compile it with `TARGET=<your-nvidia-driver-version>` (must be supported by the fix). * HACK: in Scientific Linux 6 it must be compiled with `pthread` and `dl` libraries: {{{ # cat Makefile (..) CFLAGS = -lpthread -ldl (..) }}} 1. Remove the link `/usr/lib64/libnvidia-ml.so.1` and substitute it with the just created `$PWD/libnvidia-ml.so.1` file. * Note that we use '''lib64''' (not the default Makefile's libdir location -> lib). * Do not use `make install PREFIX=/usr`, copy it by hand. * Do not create a link, since `ldconfig` will overwrite it. Now `nvidia-smi` output should look like: {{{ (..) +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | No running compute processes found | +-----------------------------------------------------------------------------+ }}} |
Line 7: | Line 69: |
1. Added to BLAHP script `/usr/libexec/sge_local_submit_attributes.sh`: | 1. Added to BLAHP script `/usr/libexec/sge_local_submit_attributes.sh`: |
Line 9: | Line 71: |
{{{ (..) if [ -n $gpu ]; then echo "#$ -l gpu=${gpu}" fi (..) }}} |
{{{ (..) if [ -n $gpu ]; then echo "#$ -l gpu=${gpu}" fi (..) }}} |
Line 20: | Line 82: |
1. Complex value 'gpu': | 1. [qmaster] Define complex value 'gpu': |
Line 30: | Line 92: |
1. Host(s) complexes: | 1. [qmaster] Host(s) complexes: |
Line 76: | Line 138: |
Line 77: | Line 140: |
1. Per-host load sensor: | 1. [qmaster] Per-host load sensor: |
Line 84: | Line 147: |
* Must be available in the execution node (e.g. shared via NFS) | |
Line 85: | Line 149: |
== (from UI) Testing == {{{ # cat test_cream.jdl [ JobType = "Normal"; Executable = "foo.sh"; StdOutput="out.out"; StdError="err.err"; InputSandbox={"foo.sh"}; OutputSandbox={"out.out", "err.err" }; OutputSandboxBaseDestUri="gsiftp://localhost"; CERequirements="gpu==2"; ] }}} |
1. [execd] Restart execd process to load the new sensor: {{{ # ps auxf (..) root 24786 0.0 0.0 163252 2268 ? Sl 16:51 0:00 /nfs4/opt/gridengine/bin/lx-amd64/sge_execd root 24798 0.0 0.0 106104 1260 ? S 16:51 0:00 \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh root 24801 0.0 0.0 106104 544 ? S 16:51 0:00 \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh root 24802 71.0 0.0 11140 988 ? R 16:51 0:00 \_ nvidia-smi -L root 24803 0.0 0.0 100924 632 ? S 16:51 0:00 \_ wc -l (..) }}} * soft-stop the service if there are jobs running. 1. [qmaster] Query the GPU-host `gpu` resource: {{{ # qhost -h tesla -F gpu HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - tesla lx-amd64 4 1 4 4 0.19 23.5G 1.7G 11.8G 0.0 Host Resource(s): hl:gpu=4.000000 }}} == References == 1. GridEngine * http://serverfault.com/questions/322073/howto-set-up-sge-for-cuda-devices * http://gridengine.org/pipermail/users/2012-April/003338.html 1. NVIDIA CUDA * https://github.com/CFSworks/nvml_fix |
Scheduling GPU resources in the Grid
Full documentation at this page.
Underlying tale of installations, applied configuration and tweaks
NVIDIA worker
Blacklist nouveau.
To avoid compilation errors (aka ERROR: Unable to load the kernel module 'nvidia.ko'...) when installing NVIDIA driver, is often not enough to include blacklist nouveau in /etc/modprobe.d/blacklist.conf. It is also required to remove it from the initrd image like so:
# echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/disable-nouveau.conf # mkinitrd -f -v /boot/initrd-$(uname -r).img $(uname -r) # or `dracut -f`
and reboot.
Install ./NVIDIA-Linux-x86_64-325.08.run or whatever other version.
Install ./cudatoolkit_4.0.17_linux_64_rhel6.0.run or whatever other version.
Check nvidia-smi command output.
- If not supported or NA information is found:
(..) +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 Not Supported | | 1 Not Supported | | 2 Not Supported | | 3 Not Supported | +-----------------------------------------------------------------------------+
we need to patch libnvidia-ml.so.1 library:
Get patch from Github's nvml_fix repository.
Compile it with TARGET=<your-nvidia-driver-version> (must be supported by the fix).
HACK: in Scientific Linux 6 it must be compiled with pthread and dl libraries:
# cat Makefile (..) CFLAGS = -lpthread -ldl (..)
Remove the link /usr/lib64/libnvidia-ml.so.1 and substitute it with the just created $PWD/libnvidia-ml.so.1 file.
Note that we use lib64 (not the default Makefile's libdir location -> lib).
Do not use make install PREFIX=/usr, copy it by hand.
Do not create a link, since ldconfig will overwrite it.
Now nvidia-smi output should look like:
(..) +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | No running compute processes found | +-----------------------------------------------------------------------------+
- If not supported or NA information is found:
CREAM CE
Added to BLAHP script /usr/libexec/sge_local_submit_attributes.sh:
(..) if [ -n $gpu ]; then echo "#$ -l gpu=${gpu}" fi (..)
Scheduler
- [qmaster] Define complex value 'gpu':
#name shortcut type relop requestable consumable default urgency #------------------------------------------------------------------------------------------- (..) gpu gpu INT <= YES YES 0 0 (..)
- [qmaster] Host(s) complexes:
hostname tesla.ifca.es load_scaling NONE complex_values gpu=4,mem_free=24G,virtual_free=24G user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE
- Load sensor:
hostname=`uname -n` while [ 1 ]; do read input result=$? if [ $result != 0 ]; then exit 1 fi if [ "$input" == "quit" ]; then exit 0 fi smitool=`which nvidia-smi` result=$? if [ $result != 0 ]; then gpusav=0 gpus=0 else gpustotal=`nvidia-smi -L|wc -l` gpusused=`nvidia-smi |grep "Process name" -A 6|grep -v +-|grep -v \|=|grep -v Usage|grep -v "No running"|wc -l` gpusavail=`echo $gpustotal-$gpusused|bc` fi echo begin echo "$hostname:gpu:$gpusavail" echo end done exit 0
- [qmaster] Per-host load sensor:
# qconf -sconf tesla #tesla.ifca.es: load_sensor /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
- Must be available in the execution node (e.g. shared via NFS)
- [execd] Restart execd process to load the new sensor:
# ps auxf (..) root 24786 0.0 0.0 163252 2268 ? Sl 16:51 0:00 /nfs4/opt/gridengine/bin/lx-amd64/sge_execd root 24798 0.0 0.0 106104 1260 ? S 16:51 0:00 \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh root 24801 0.0 0.0 106104 544 ? S 16:51 0:00 \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh root 24802 71.0 0.0 11140 988 ? R 16:51 0:00 \_ nvidia-smi -L root 24803 0.0 0.0 100924 632 ? S 16:51 0:00 \_ wc -l (..)
- soft-stop the service if there are jobs running.
[qmaster] Query the GPU-host gpu resource:
# qhost -h tesla -F gpu HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - tesla lx-amd64 4 1 4 4 0.19 23.5G 1.7G 11.8G 0.0 Host Resource(s): hl:gpu=4.000000