Basic Grid Engine Usage

<<TableOfContents: execution failed [list index out of range] (see also the log)>>

General Notes

Grid Engine jobs are submitted to the cluster by means of of the qsub, qrsh and qlogin programs:

Additionally, the qalter program allows to change the attributes of the pending jobs.

All of these commands allow a wide range of arguments that control the job behaviour; and how the job is scheduled. These options can be specified as follows, from lower to high priority:

This means that a command line option will override the same embedded flag, and that a embedded flag will override that flag if specified in the request file.

The general algorithm to submit a job should be:

  1. Prepare the files (scripts, data, source code, makefiles, etc.) for submitting the job.
  2. Copy those files to the appropriate directory in /gpfs/csic_projects (see the note of http://grid.ifca.es/wiki/Cluster/Usage#Shared_areas.

  3. Submit the job
  4. Monitor the job with qstat -u <your username>

  5. Get your results.

Normally you don't need to specify all of the available options/directives. The following options must be specified; otherwise default values will be imposed and your job will either be penalized or will fail:

The following options are not mandatory. They are recommended though:

Shell

The default shell for the jobs is csh. You can change it using the -S <shell> with choosing any shell from sh, ksh, csh, tcsh and bash.

Working directories and output files

If you specify the -cwd flag your job will run on the directory from where you submitted your job, otherwise, it will run on your home directory. This also affects the output files of the job (the standard output and error) that will be placed in that location.

If you want to specify a different location for the standard output and error files, you can use the -o <path> and -e <path> options respectively. If you want to join both files into only one, you can specify the -j y option.

If your job does make an intensive usage of the disk, you should instruct it to write to the $TMPDIR. This environment variable is set to a temporary local scratch directory that is removed after the job has finished, so an example algorithm for this kind of jobs should be as follows:

  1. Copy the code and files to $TMPDIR
  2. Set up the environment
  3. Run the job
  4. Copy the results to the $SGE_O_WORKDIR ($TMPDIR is wiped out after the job completion)

Job submission, projects and queues

In the IFCA cluster you should not submit the jobs directly to any queue, since this selection is left out to the scheduler. You must submit your jobs to a project, therefore the scheduler will be able to dispatch the jobs to the most suitable queues.

The project selection is made with the -P <project> option.

$ qsub -P <project> <jobfile>

You can get the list of projects you are allowed to use by issuing:

$ /nfs4/usr/bin/rep-sge-prjs.py

Job submission without specifying a project is not allowed

You should contact your supervisor if you are unsure about your project.

Do not specify queues in your submission

Although it is possible to specify a queue in your job submission, it is not recommended to do so. Access to certain queues is possible only if the user and/or project has special privileges, so if you make a hard request for a given queue, your job surely won't be properly scheduled (it could even starve).

Notifications

You can get email notifications whenever a job changes its state. A valid email address must be specified, using the -M option, as well as the kind of notification using the -m option. Valid values are:

Please note that -m can be specified several times, so the following job submission

$ qsub -m b -m e -M user@example.org <jobfile>

will produce an email when the job has started and ended.

Specifying resources

In order to get your submitted jobs executed fast (and to benefit from any possible backfilling), you should tweak some of the resources that your job is requesting. The more accurate you are to your job bounds, the faster your job will run (the default values are quite high in order to prevent jobs to be killed by the batch system, thus they penalize a lot the job execution).

Some of these limits are defined in the $SGE_ROOT/default/common/sge_request file. If your application is always expected to use the same values, you can override that file by creating a $HOME/.sge_request file. For further details, please check the sge_request manual page.

Resources are specified with the -l <resource 1>,<resource 2>,...<resource n> switch.

1. Wall Clock time

A default wall clock time of 72 hours is enforced by default in all jobs submitted to the cluster. Should you require a higher or lower value, please set it by yourself by requesting a new h_rt value in the form hours:minutes:seconds. Please note that requesting a high value may impact negatively in your job scheduling and that a low value may make your job eligible for backfilling (thus will be dispatched earlier). Please try to be as accurate as possible when setting this required value. For example, a job requiring 22h, with a couple of safety extra hours should be sent as follows:

$ qsub -l h_rt=24:00:00 <jobfile>

This is a hard request, and any job exceeding the requested value will be killed.

2. Memory management

When requesting memory for a job you must take into account that per-job memory is limited in the default queues to a Resident Set Size (h_rss) of 5 GB. If you need to use more memory, you should request the special resource highmem. Please notice that your group may not be able to request that flag by default. If you need to do so, please open a ticket requesting it. Also, notice that these nodes might be overloaded by other users requesting the same flag, so use it wisely.

It is highly recommended that you tune your memory requirements to some realistic values and not rely on the default values. Special emphasis is made in the following resources:

2.1. h_rss

This limit refers to the hard resident set size limit. The batch system will make sure a given job does not consume more memory than the value assigned to that variable. This means that any job above the requested h_rss limit will be killed (SIGKILL) by the batch system. It is recommended to request this resource as a top limit for your application. If you expect your job to consume no more than a peak value of 3GB you should request those 3GB as its resident set size limit. This request shall not produce a penalty on the scheduling of your jobs.

This is a hard request, and any job exceeding the requested value will be killed.

2.2. mem_free

This refers to the free RAM necessary for the job to run. The batch system will allow jobs to run only if sufficient memory (as requested by mem_free) is available for them on a given node. It will subtract that amount of memory from the available resources, once the job is running. This ensures that a node with 16 GB of memory will not run jobs totaling more than 16 GB. The default value is 1.8 GB per slot. Please note that breaking the mem_free limit will not automatically kill your job. Its aim is just to try to ensure that your job has available the memory you requested. Also note that this value is not intended to be use to reflect the memory peaks of your job. This request will impact the scheduling of your jobs, so it is highly recommended to tune it to fit your actual application memory usage.

2.3. Memory usage above 5G

For serial jobs requiring more than 5 GB of memory, submission requesting the highmem flag is necessary. Using this flag, the h_rss limit will be unset, but the requirement tuning described above still applies. If your group is allowed to request it, and your job needs 20GB of memory, you send a job as follows:

2.4. Examples

3. Infiniband

Infiniband is not anymore available at this cluster.

If you are executing MPI parallel jobs you may benefit from the Infiniband interconnection available on the nodes. In order to do so, you must request the special resource infiniband:

Setting up the environment

There also useful alternatives to export or modify the job environment for the execution.

Array jobs

It is possible to submit a so called Array Job, i.e. an array of identical tasks being differentiated only by an index number and being treated by Grid Engine almost like a series of jobs. The option to submit an array job is the -t option. The argument of the -t option specifies the number of array job tasks and the index number which will be associated with the tasks. The index numbers will be exported to the job tasks via the environment variable SGE_TASK_ID.

Parallel jobs

Parallel jobs must be submitted to a parallel environment (-pe <pe name> <slots>) specifying the number of slots required. Depending on the used pe, SGE will allocate the slots in a different way.

$ qsub -pe mpi 8 <jobfile>

Please note that parallel jobs will be routed to the parallel queues in which no hard memory limits are set (see previous section).

The following parallel environments are available:

PE Name Node distribution
smp All slots have to be in just 1 node
mpi All slots spread across available nodes. It will try to pack the slots into single nodes though
8mpi Will use 8-slot servers in exclusivity. The number of slots requested must be multiple of 8.
24mpi Will use 24-slot servers in exclusivity. The number of slots requested must be multiple of 24.
48mpi Will use 48-slot servers in exclusivity. The number of slots requested must be multiple of 48.

The actual command line for the execution of the parallel application depends on the parallel library/framework that it uses for communication between the allocated slots. At IFCA, Open MPI v.1.4 is configured in the default user environment and its installed at /usr/lib64/openmpi/1.4-gcc. Open MPI includes tight integration with the batch system, therefore the execution of applications with as many processes as allocated slots does not require any special arguments for mpiexec. The following example executes an 8 processes application:

#$ -pe mpi 8

# be sure to include the complete path or
# invoke mpiexec from the correct directory
mpiexec /path/to/your/application

# This is equivalent to: 
# mpiexec -np $NSLOTS /path/to/your/application
# ($NSLOTS is defined by SGE and is the number of allocated slots)

Better control of the processes started can be achieved with several mpiexec/mpirun parameters, like -np which allows to set the total number of processes to start or -npernode which allows to fix the number of processes per available node.

Open MPI will try to use the best available communication network during runtime. In order to restrict the communication method you may use the --mca btl parameter of mpiexec. Forcing a communication network may turn your application unrunable, Open MPI selects automatically the best communication method for you. For a list of available communication methods, use ompi_info command as shown:

$ ompi_info | grep btl
                 MCA btl: ofud (MCA v2.0, API v2.0, Component v1.4)
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.4)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.4)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.4)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.4)

For example to avoid the use of tcp network, you could use the following command line:

# ^tcp means anything but tcp
mpiexec --mca btl ^tcp /path/to/your/application

More examples can be found in the Open MPI FAQ.

mpi-start is also installed at the cluster. It may be useful if you are testing more than one MPI implementation or submitting jobs via grid. Check its user documentation for more information.

Job reservation

It is possible to indicate whether a reservation for a job should be done or not using the -R y option. When a runnable job cannot be started due to a shortage of resources, a reservation can be scheduled instead. This is specially useful for parallel jobs, or jobs requesting bottleneck resources (for example, a high amount of memory), and it is normally useless for normal sequential jobs. Nodes that are reserved may be eligible for backfilling, if there are jobs with a time request smaller that the predicted time for the start of the reserved job.

$ qsub -R y <jobfile>

Interactive jobs

Interactive, short lived and high priority jobs can be sent if your project has permission to do so. This kind of jobs can only request a maximum of 2h of WALL clock time, see previous section for details about limiting the wall clock time of a job.

X11 forwarding is possible when using the qlogin command. Using X11 forwarding requires a valid DISPLAY, use ssh -X or ssh -Y to enable X11 forwarding in your ssh session when logging in the UI.

$ qlogin -P <project> -l h_rt=1:00:00

Short jobs

A special resource, called immediate is available for some users, that need fast scheduling for their short-lived batch jobs. This kind of jobs can only request a maximum of 2h of WALL clock time.

$ qsub -l immediate <jobfile>

Please note that you might not have access to these resources.

Advanced reservation

Some users and/or projects might request a reservation of a set of resources in advance. This is called an "Advanced Reservation (AR). If your project needs such a reservation you should make a petition using the support helpdesk. You need to specify the following:

Once the request has been made, the system administrators will give you the ID(s) of the AR created. You can submit your jobs whenever you want by issuing:

$ qsub -ar <reservation_id> <other_job_options>

You can submit your job(s) before the AR starts and also once it is started. However, you should take care of the duration of the reservation and your job' duration. If your job execution exceeds either the h_rt that it has requested or the duration of the AR it will be killed by the batch system.

You should also take into account that your reservation might not be created in the date and time that you requested if there are no resources available. In this case, it will be created whenever it is possible. To avoid this, please request your reservations well in advance.

Since the requested and reserved resources cannot be used for other jobs, those requested resources will be used for accounting purposes as if they were resources used by normal jobs (even in the case that the AR is unused). Please request only the resources that you need.

If you want to query the existing advance reservations, you can use the qrstat command. To query about an specific advance reservation, you can issue:

$ qrstat -ar ''<reservation_id>''

Examples

There is a set of examples under /nfs4/opt/gridengine_examples/. In order to get familiarized with SGE, you should copy that directory to your $HOME directory, inspect and submit the scripts present there:

$ cp -r /nfs4/opt/gridengine_examples/ .
$ cd gridengine_examples
(have a look at the scripts and modify them if needed)
$ qsub 01_hello_world.sh
$ qsub 02_hello_world_tasks.sh 
$ qsub 03_hello_world_parallel.sh
$ qsub 04_calculate_pi.sh