Size: 3477
Comment:
|
Size: 3764
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 125: | Line 125: |
== Sources == 1. GridEngine * http://serverfault.com/questions/322073/howto-set-up-sge-for-cuda-devices * http://gridengine.org/pipermail/users/2012-April/003338.html 1. NVIDIA CUDA * https://devtalk.nvidia.com/default/topic/697308/compute-processes-not-supported/ |
Scheduling GPU resources in the Grid
Tweaks and applied configuration
CREAM CE
1. Added to BLAHP script /usr/libexec/sge_local_submit_attributes.sh:
(..) if [ -n $gpu ]; then echo "#$ -l gpu=${gpu}" fi (..)
Scheduler
- [qmaster] Define complex value 'gpu':
#name shortcut type relop requestable consumable default urgency #------------------------------------------------------------------------------------------- (..) gpu gpu INT <= YES YES 0 0 (..)
- [qmaster] Host(s) complexes:
hostname tesla.ifca.es load_scaling NONE complex_values gpu=4,mem_free=24G,virtual_free=24G user_lists NONE xuser_lists NONE projects NONE xprojects NONE usage_scaling NONE report_variables NONE
- Load sensor:
hostname=`uname -n` while [ 1 ]; do read input result=$? if [ $result != 0 ]; then exit 1 fi if [ "$input" == "quit" ]; then exit 0 fi smitool=`which nvidia-smi` result=$? if [ $result != 0 ]; then gpusav=0 gpus=0 else gpustotal=`nvidia-smi -L|wc -l` gpusused=`nvidia-smi |grep "Process name" -A 6|grep -v +-|grep -v \|=|grep -v Usage|grep -v "No running"|wc -l` gpusavail=`echo $gpustotal-$gpusused|bc` fi echo begin echo "$hostname:gpu:$gpusavail" echo end done exit 0
- [qmaster] Per-host load sensor:
# qconf -sconf tesla #tesla.ifca.es: load_sensor /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
- Must be available in the execution node (e.g. shared via NFS)
- [execd] Restart execd process to load the new sensor:
# ps auxf (..) root 24786 0.0 0.0 163252 2268 ? Sl 16:51 0:00 /nfs4/opt/gridengine/bin/lx-amd64/sge_execd root 24798 0.0 0.0 106104 1260 ? S 16:51 0:00 \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh root 24801 0.0 0.0 106104 544 ? S 16:51 0:00 \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh root 24802 71.0 0.0 11140 988 ? R 16:51 0:00 \_ nvidia-smi -L root 24803 0.0 0.0 100924 632 ? S 16:51 0:00 \_ wc -l (..)
- soft-stop the service if there are jobs running.
[qmaster] Query the GPU-host gpu resource:
# qhost -h tesla -F gpu HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - tesla lx-amd64 4 1 4 4 0.19 23.5G 1.7G 11.8G 0.0 Host Resource(s): hl:gpu=4.000000
(from UI) Testing
# cat test_cream.jdl [ JobType = "Normal"; Executable = "foo.sh"; StdOutput="out.out"; StdError="err.err"; InputSandbox={"foo.sh"}; OutputSandbox={"out.out", "err.err" }; OutputSandboxBaseDestUri="gsiftp://localhost"; CERequirements="gpu==2"; ]