welcome: please sign in
location: Diff for "FitSM/GR2DOC/Tools/Grid/GPU"
Differences between revisions 1 and 7 (spanning 6 versions)
Revision 1 as of 2014-10-22 14:15:49
Size: 945
Editor: orviz
Comment:
Revision 7 as of 2014-10-23 08:03:39
Size: 3825
Editor: orviz
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
== Tweaks and applied configuration == == Underlying tale of installations, applied configuration and tweaks ==


=== NVIDIA worker ===
Line 20: Line 23:
1. Complex value 'gpu':  1. [qmaster] Define complex value 'gpu':
Line 22: Line 25:
{{{
#name shortcut type relop requestable consumable default urgency
#-------------------------------------------------------------------------------------------
(..)
gpu gpu INT <= YES YES 0 0
(..)
}}}
    {{{
    #name shortcut type relop requestable consumable default urgency
    #-------------------------------------------------------------------------------------------
    (..)
    gpu gpu INT <= YES YES 0 0
    (..)
    }}}
   1. [qmaster] Host(s) complexes:
    {{{
    hostname tesla.ifca.es
    load_scaling NONE
    complex_values gpu=4,mem_free=24G,virtual_free=24G
    user_lists NONE
    xuser_lists NONE
    projects NONE
    xprojects NONE
    usage_scaling NONE
    report_variables NONE
    }}}
Line 30: Line 46:
2. Load sensor:  1. Load sensor:
    {{{
hostname=`uname -n`

while [ 1 ]; do
  read input
  result=$?
  if [ $result != 0 ]; then
    exit 1
  fi
  if [ "$input" == "quit" ]; then
    exit 0
  fi
Line 33: Line 61:
  smitool=`which nvidia-smi`
  result=$?
  if [ $result != 0 ]; then
    gpusav=0
    gpus=0
  else
    gpustotal=`nvidia-smi -L|wc -l`
    gpusused=`nvidia-smi |grep "Process name" -A 6|grep -v +-|grep -v \|=|grep -v Usage|grep -v "No running"|wc -l`
    gpusavail=`echo $gpustotal-$gpusused|bc`
  fi

  echo begin
  echo "$hostname:gpu:$gpusavail"
  echo end
done

exit 0
    }}}
     

 1. [qmaster] Per-host load sensor:
    {{{
    # qconf -sconf tesla
    #tesla.ifca.es:
    load_sensor /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
    }}}

    * Must be available in the execution node (e.g. shared via NFS)

 1. [execd] Restart execd process to load the new sensor:
    {{{
# ps auxf
(..)
root 24786 0.0 0.0 163252 2268 ? Sl 16:51 0:00 /nfs4/opt/gridengine/bin/lx-amd64/sge_execd
root 24798 0.0 0.0 106104 1260 ? S 16:51 0:00 \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
root 24801 0.0 0.0 106104 544 ? S 16:51 0:00 \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
root 24802 71.0 0.0 11140 988 ? R 16:51 0:00 \_ nvidia-smi -L
root 24803 0.0 0.0 100924 632 ? S 16:51 0:00 \_ wc -l
(..)
    }}}

    * soft-stop the service if there are jobs running.

 1. [qmaster] Query the GPU-host `gpu` resource:
   {{{
   # qhost -h tesla -F gpu
   HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
   ----------------------------------------------------------------------------------------------
   global - - - - - - - - - -
   tesla lx-amd64 4 1 4 4 0.19 23.5G 1.7G 11.8G 0.0
       Host Resource(s): hl:gpu=4.000000
   }}}
Line 48: Line 128:

== Sources ==
 1. GridEngine
    * http://serverfault.com/questions/322073/howto-set-up-sge-for-cuda-devices
    * http://gridengine.org/pipermail/users/2012-April/003338.html
 1. NVIDIA CUDA
    * https://devtalk.nvidia.com/default/topic/697308/compute-processes-not-supported/

Scheduling GPU resources in the Grid

Underlying tale of installations, applied configuration and tweaks

NVIDIA worker

CREAM CE

1. Added to BLAHP script /usr/libexec/sge_local_submit_attributes.sh:

(..)
if [ -n $gpu ]; then
    echo "#$ -l gpu=${gpu}"
fi
(..)

Scheduler

  1. [qmaster] Define complex value 'gpu':
    •     #name               shortcut     type        relop requestable consumable default  urgency  
          #-------------------------------------------------------------------------------------------
          (..)
          gpu                 gpu          INT         <=    YES         YES        0        0
          (..)
  2. [qmaster] Host(s) complexes:
    •     hostname              tesla.ifca.es
          load_scaling          NONE
          complex_values        gpu=4,mem_free=24G,virtual_free=24G
          user_lists            NONE
          xuser_lists           NONE
          projects              NONE
          xprojects             NONE
          usage_scaling         NONE
          report_variables      NONE
  3. Load sensor:
    • hostname=`uname -n`
      
      while [ 1 ]; do
        read input
        result=$?
        if [ $result != 0 ]; then
          exit 1
        fi
        if [ "$input" == "quit" ]; then
          exit 0
        fi
      
      
        smitool=`which nvidia-smi`
        result=$?
        if [ $result != 0 ]; then
          gpusav=0
          gpus=0
        else
          gpustotal=`nvidia-smi -L|wc -l`
          gpusused=`nvidia-smi |grep "Process name" -A 6|grep -v +-|grep -v \|=|grep -v Usage|grep -v "No running"|wc -l`
          gpusavail=`echo $gpustotal-$gpusused|bc`
        fi
      
        echo begin
        echo "$hostname:gpu:$gpusavail"
        echo end
      done
      
      exit 0
  4. [qmaster] Per-host load sensor:
    •     # qconf -sconf tesla
          #tesla.ifca.es:
          load_sensor                  /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
    • Must be available in the execution node (e.g. shared via NFS)
  5. [execd] Restart execd process to load the new sensor:
    • # ps auxf
      (..)
      root     24786  0.0  0.0 163252  2268 ?        Sl   16:51   0:00 /nfs4/opt/gridengine/bin/lx-amd64/sge_execd
      root     24798  0.0  0.0 106104  1260 ?        S    16:51   0:00  \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
      root     24801  0.0  0.0 106104   544 ?        S    16:51   0:00      \_ /bin/sh /nfs4/opt/gridengine/util/resources/loadsensors/gpu.sh
      root     24802 71.0  0.0  11140   988 ?        R    16:51   0:00          \_ nvidia-smi -L
      root     24803  0.0  0.0 100924   632 ?        S    16:51   0:00          \_ wc -l
      (..)
    • soft-stop the service if there are jobs running.
  6. [qmaster] Query the GPU-host gpu resource:

    •    # qhost -h tesla -F gpu
         HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
         ----------------------------------------------------------------------------------------------
         global                  -               -    -    -    -     -       -       -       -       -
         tesla                   lx-amd64        4    1    4    4  0.19   23.5G    1.7G   11.8G     0.0
             Host Resource(s):      hl:gpu=4.000000

(from UI) Testing

# cat test_cream.jdl
[ 
  JobType = "Normal"; 
  Executable = "foo.sh"; 
  StdOutput="out.out"; 
  StdError="err.err"; 
  InputSandbox={"foo.sh"}; 
  OutputSandbox={"out.out", "err.err" }; 
  OutputSandboxBaseDestUri="gsiftp://localhost";
  CERequirements="gpu==2";
]

Sources

  1. GridEngine

  2. NVIDIA CUDA

eciencia: FitSM/GR2DOC/Tools/Grid/GPU (last edited 2016-07-07 11:11:52 by nunezm)