Differences between revisions 3 and 4

Scheduling GPU resources in the Grid

Tweaks and applied configuration

CREAM CE

1. Added to BLAHP script /usr/libexec/sge_local_submit_attributes.sh:

(..)
if [ -n $gpu ]; then
    echo "#$ -l gpu=${gpu}"
fi
(..)

Scheduler

[qmaster] Define complex value 'gpu':

    #name               shortcut     type        relop requestable consumable default  urgency  
    #-------------------------------------------------------------------------------------------
    (..)
    gpu                 gpu          INT         <=    YES         YES        0        0
    (..)

[qmaster] Host(s) complexes:

    hostname              tesla.ifca.es
    load_scaling          NONE
    complex_values        gpu=4,mem_free=24G,virtual_free=24G
    user_lists            NONE
    xuser_lists           NONE
    projects              NONE
    xprojects             NONE
    usage_scaling         NONE
    report_variables      NONE

Load sensor:

hostname=`uname -n`

while [ 1 ]; do
  read input
  result=$?
  if [ $result != 0 ]; then
    exit 1
  fi
  if [ "$input" == "quit" ]; then
    exit 0
  fi


  smitool=`which nvidia-smi`
  result=$?
  if [ $result != 0 ]; then
    gpusav=0
    gpus=0
  else
    gpustotal=`nvidia-smi -L|wc -l`
    gpusused=`nvidia-smi |grep "Process name" -A 6|grep -v +-|grep -v \|=|grep -v Usage|grep -v "No running"|wc -l`
    gpusavail=`echo $gpustotal-$gpusused|bc`
  fi

  echo begin
  echo "$hostname:gpu:$gpusavail"
  echo end
done

exit 0

[qmaster] Per-host load sensor:

    # qconf -sconf tesla
    #tesla.ifca.es:
    load_sensor                  /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh

Must be available in the execution node (e.g. shared via NFS)

[execd] Restart execd process to load the new sensor:

# ps auxf
(..)
root     24786  0.0  0.0 163252  2268 ?        Sl   16:51   0:00 /nfs4/opt/gridengine/bin/lx-amd64/sge_execd
root     24798  0.0  0.0 106104  1260 ?        S    16:51   0:00  \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
root     24801  0.0  0.0 106104   544 ?        S    16:51   0:00      \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
root     24802 71.0  0.0  11140   988 ?        R    16:51   0:00          \_ nvidia-smi -L
root     24803  0.0  0.0 100924   632 ?        S    16:51   0:00          \_ wc -l
(..)

soft-stop the service if there are jobs running.

(from UI) Testing

# cat test_cream.jdl
[ 
  JobType = "Normal"; 
  Executable = "foo.sh"; 
  StdOutput="out.out"; 
  StdError="err.err"; 
  InputSandbox={"foo.sh"}; 
  OutputSandbox={"out.out", "err.err" }; 
  OutputSandboxBaseDestUri="gsiftp://localhost";
  CERequirements="gpu==2";
]

eciencia: FitSM/GR2DOC/Tools/Grid/GPU (last edited 2016-07-07 11:11:52 by nunezm)

-  ⇤ ← Revision 3 as of 2014-10-22 14:50:06 → 
  Size: 2109
  Editor: orviz
  Comment:
+   ← Revision 4 as of 2014-10-22 14:56:35 → ⇥
  Size: 2938
  Editor: orviz
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 20:
-. Complex value 'gpu':
+. [qmaster] Define complex value 'gpu':
 Line 30:
-. Host(s) complexes:
+. [qmaster] Host(s) complexes:
 Line 76:
-Line 77:
+Line 78:
-. Per-host load sensor:
+. [qmaster] Per-host load sensor:
-Line 84:
+Line 85:
+    * Must be available in the execution node (e.g. shared via NFS)

 1. [execd] Restart execd process to load the new sensor:
    {{{
# ps auxf
(..)
root     24786  0.0  0.0 163252  2268 ?        Sl   16:51   0:00 /nfs4/opt/gridengine/bin/lx-amd64/sge_execd
root     24798  0.0  0.0 106104  1260 ?        S    16:51   0:00  \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
root     24801  0.0  0.0 106104   544 ?        S    16:51   0:00      \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
root     24802 71.0  0.0  11140   988 ?        R    16:51   0:00          \_ nvidia-smi -L
root     24803  0.0  0.0 100924   632 ?        S    16:51   0:00          \_ wc -l
(..)
    }}}

    * soft-stop the service if there are jobs running.

Quick Links

Search Wiki

Page Tools

Scheduling GPU resources in the Grid

Tweaks and applied configuration

CREAM CE

Scheduler

(from UI) Testing