welcome: please sign in
location: Diff for "Cluster/Usage"
Differences between revisions 6 and 15 (spanning 9 versions)
Revision 6 as of 2010-06-29 15:49:47
Size: 16519
Editor: aloga
Comment:
Revision 15 as of 2010-07-08 11:30:24
Size: 16538
Editor: aloga
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
Line 8: Line 7:

'''GRIDUI''' (Grid [[User Interface]]) cluster is the interactive gateway for
the computing resources and projects IFCA is involved on. It is a SSH
[http://www.linuxvirtualserver.org/ LVS], based on [http://linuxsoft.cern.ch/ Scientific Linux Cern]
versions 5. For legacy applications some nodes on Scientific Linux Cern 4 are still available, however the mainteinance of these machines is made in a best-effort basis. The login machines are as follow:
'''GRIDUI''' (Grid [[User Interface]]) cluster is the interactive gateway for the computing resources and projects IFCA is involved on. It is a SSH [[http://www.linuxvirtualserver.org/|LVS]], based on [[http://linuxsoft.cern.ch/|Scientific Linux CERN]] versions 5. For legacy applications some nodes on Scientific Linux CERN 4 are still available, however the maintenance of these machines is made in a best-effort basis. The login machines are as follow:
Line 22: Line 17:

Login on these machines is provided via
[[http://en.wikipedia.org/wiki/Secure_Shell|Secure Shell]]. You can use the standard port 22 or the 22000.

Please note that this cluster is not intended for the execution of CPU
intensive tasks, for this purpose use any of the available computing
resources.
Login on these machines is provided via [[http://en.wikipedia.org/wiki/Secure_Shell|Secure Shell]]. You can use the standard port 22 or the 22000.

Please note that this cluster is not intended for the execution of CPU intensive tasks, for this purpose use any of the available computing resources.
Line 33: Line 24:

Authentication is centralized via secured LDAP. All the changes made to a user
account in one node take immediate effect in the whole cluster. There is also
a secured web interface, that allows a user to change his/her details,
available at https://cerbero.ifca.es/. If you need to reset your account
password, please contact the system administrators.

With this username you should be able to access also the ticketing system at
http://support.ifca.es/
Authentication is centralized via secured LDAP. All the changes made to a user account in one node take immediate effect in the whole cluster. There is also a secured web interface, that allows a user to change his/her details, available at https://cerbero.ifca.es/. If you need to reset your account password, please contact the system administrators.

With this username you should be able to access also the ticketing system at http://support.ifca.es/
Line 44: Line 29:

As the cluster is based on several Grid User Interfaces it allows users to access either
EGEE-III, int.eu.grid, [[http://www.euforia-project.eu/EUFORIA/|EUFORIA]], [[https://web.lip.pt/wiki-IBERGRID/|IBERGRID]] and [[http://grid.csic.es|GRID-CSIC]] infrastructures. To set up the correct
environment variables, please <code>source</code> any of the environment scripts
located under <code>
/gpfs/csic_projects/grid/etc/env/</code>. For example, to
use the I2G infrastructure:

{{{
% source /gpfs/csic_projects/grid/etc/env/i2g-env.sh
}}}
As the cluster is based on several Grid User Interfaces it allows users to access either EGEE-III, int.eu.grid, [[http://www.euforia-project.eu/EUFORIA/|EUFORIA]], [[https://web.lip.pt/wiki-IBERGRID/|IBERGRID]] and [[http://grid.csic.es|GRID-CSIC]] infrastructures. To set up the correct environment variables, please `source` any of the environment scripts located under `/gpfs/csic_projects/grid/etc/env/`. For example, to use the I2G infrastructure:

{{{#!highlight console numbers=disable
% source /gpfs/csic_projects/grid/etc/env/i2g-env.sh }}}
Line 81: Line 59:

More information on seting up the Grid UI is available locally at the
[[DUS: Setting up the User Interface account|corresponding DUS page]].
More information on seting up the Grid UI is available locally at the  [[DUS: Setting up the User Interface account|corresponding DUS page]].
Line 86: Line 62:

PBS cluster was <strong>decommissioned</strong> in December 2009. Please use the new [[#SGE_Cluster|SGE Cluster]] instead.
PBS cluster was '''decommissioned''' in December 2009. Please use the new [[#SGE_Cluster|SGE Cluster]] instead.
Line 90: Line 65:

The SGE Cluster is based on <code>Scientific Linux CERN SLC release
5.5</code> machines, running on x86_64. The exact number or resources
available is shown on the
[[http://monitor.ifca.es/ganglia/?c=SGE%20Worker%20Nodes&m=&r=hour&s=descending&hc=4|monitorization page]].
The SGE Cluster is based on `Scientific Linux CERN SLC release 5.5` machines, running on x86_64. The exact number or resources available is shown on the [[http://monitor.ifca.es/ganglia/?c=SGE%20Worker%20Nodes&m=&r=hour&s=descending&hc=4|monitorization page]].
Line 97: Line 68:

Local submission is allowed for certain projects. As stated below, there are some shared areas that can be accessed from the computing nodes. The underlying batch system is [http://gridengine.sunsource.net/ Sun Grid Engine]. Please note that the syntax for job submission (<tt>qsub</tt>) and monitoring (<tt>qstat</tt>) is similar to the one you might be accustomed to from PBS, but there are important differences. Refer to the following sources for information:

* Grid Engine [[http://gridengine.sunsource.net/documentation.html|documentation page]]
* Some useful [[http://gridengine.sunsource.net/project/gridengine/howto/howto.html|HOWTOS]]
* SGE [[http://gridengine.sunsource.net/project/gridengine/howto/basic_usage.html|basic usage]].

Users should submit their jobs directly to their project by using <code>qstat -P <project></code>. In other words, instead of:

{{{
% qsub -q lhidra ''jobfile''
}}}

you should write (note the dot between the "<tt>l</tt>" and the project name):

{{{
% qsub -P l.hidra ''jobfile''
}}}

If a project name is not specified, the job will fall in the ''catch-all low-priority'' queue. Submission without specifying a project is allowed, in order to perform <strong>short</strong> test jobs. However, it is in the best interest of the user to use her project in general, so that the job can take full advantage of more capable (longer, more CPUs, higher priority) queues.

A scratch area is defined for every job as the environment variable <code>$TMPDIR</code>. This area is cleared after the job has exited.
Local submission is allowed for certain projects. As stated below, there are some shared areas that can be accessed from the computing nodes. The underlying batch system is [[http://gridengine.sunsource.net/%20Sun%20Grid%20Engine|http://gridengine.sunsource.net/%20Sun%20Grid%20Engine]]. Please note that the syntax for job submission (`qsub`) and monitoring (`qstat`) is similar to the one you might be accustomed to from PBS, but there are important differences. Refer to the following sources for information:

* Grid Engine [[http://gridengine.sunsource.net/documentation.html|documentation page]] * Some useful [[http://gridengine.sunsource.net/project/gridengine/howto/howto.html|HOWTOS]] * SGE [[http://gridengine.sunsource.net/project/gridengine/howto/basic_usage.html|basic usage]].

Users should submit their jobs directly to their project by using `qstat -P <project>`. In other words, instead of:

{{{#!highlight console numbers=disable
% qsub -q lhidra <jobfile>}}}
you should write (note the dot between the "`l`" and the project name):

{{{#!highlight console numbers=disable
% qsub -P l.hidra <jobfile>}}}
If a project name is not specified, the job will fall in the ''catch-all low-priority'' queue. Submission without specifying a project is allowed, in order to perform '''short''' test jobs. However, it is in the best interest of the user to use her project in general, so that the job can take full advantage of more capable (longer, more CPUs, higher priority) queues.

A scratch area is defined for every job as the environment variable `$TMPDIR`. This area is cleared after the job has exited.
Line 121: Line 85:

A default wall clock time of 72 hours is enforced by default in all jobs submitted to the cluster. Should you require a higher value, please set it by yourself by requesting a new <code>h_rt</code> value in the form <code>hours:minutes:seconds</code>. Please note that requesting a high value may impact negatively in your job scheduling and execution. Please try to be as accurate as possible when setting this required value. For example, a job requiring 23h should be sent as follows:

{{{
% qsub -P l.hidra -l h_rt=24:00:00 ''jobfile''
}}}
A default wall clock time of 72 hours is enforced by default in all jobs submitted to the cluster. Should you require a higher value, please set it by yourself by requesting a new `h_rt` value in the form `hours:minutes:seconds`. Please note that requesting a high value may impact negatively in your job scheduling and execution. Please try to be as accurate as possible when setting this required value. For example, a job requiring 23h should be sent as follows:

{{{#!highlight console numbers=disable
% qsub -P l.hidra -l h_rt=24:00:00 <jobfile>}}}
Line 129: Line 90:
Line 132: Line 92:
However, it is <strong>highly recommended</strong> that you tune your memory requirements to some realistic values. Special emphasis is made in the following resources:

 * <code>h_rss</code>
 * <code>mem_free</code>

====== h_rss ======

The first one (<code>h_rss</code>) refers to the '''hard resident set size limit'''. The batch system will make sure a given job does not consume more memory than the value assigned to that variable. This means that <strong>any job above the requested <code>h_rss</code> limit will be killed (SIGKILL) by the batch system</strong>. It is recommended to request this resource as a top limit for your application. If you expect your job to consume no more than a peak value of 3GB you should request those 3GB as its resident set size limit. This request will not produce a penalty on the scheduling of your jobs.

====== mem_free ======

The second one (<code>mem_free</code>) refers to the free RAM necessary for the job to run. The batch system will allow jobs to run only if sufficient memory (as requested by <code>mem_free</code>) is available for them. It will also subtract that amount of memory from the available resources, once the job is running. This ensures that a node with 16 GB of memory will not run jobs totalling more than 16 GB. The default value is 1.8 GB per slot. Please note that breaking the <code>mem_free</code> limit will not automatically kill your job. Its aim is just to ensure that your job has available the memory you requested. Also note that this value is not intended to be use to reflect the memory peaks of your job. This request will impact the scheduling of your jobs, so it is highly recommended to tune it to fit your application memory usage.

This limit is defined in the <code>
/opt/gridengine/default/common/sge_request</code> file. If your application is always expected to use the same values, you can override that file by creating a <code>$HOME/.sge_request</code> file. For further details, please check the <code>sge_request</code> manual page.

====== Examples ======
However, it is '''highly recommended''' that you tune your memory requirements to some realistic values. Special emphasis is made in the following resources:

 * `h_rss`
 * `mem_free`

===== h_rss =====
The first one (`h_rss`) refers to the '''hard resident set size limit'''. The batch system will make sure a given job does not consume more memory than the value assigned to that variable. This means that '''any job above the requested `h_rss` limit will be killed (SIGKILL) by the batch system'''. It is recommended to request this resource as a top limit for your application. If you expect your job to consume no more than a peak value of 3GB you should request those 3GB as its resident set size limit. This request will not produce a penalty on the scheduling of your jobs.

===== mem_free =====
The second one (`mem_free`) refers to the free RAM necessary for the job to run. The batch system will allow jobs to run only if sufficient memory (as requested by `mem_free`) is available for them. It will also subtract that amount of memory from the available resources, once the job is running. This ensures that a node with 16 GB of memory will not run jobs totalling more than 16 GB. The default value is 1.8 GB per slot. Please note that breaking the `mem_free` limit will not automatically kill your job. Its aim is just to ensure that your job has available the memory you requested. Also note that this value is not intended to be use to reflect the memory peaks of your job. This request will impact the scheduling of your jobs, so it is highly recommended to tune it to fit your application memory usage.


This limit is defined in the `/opt/gridengine/default/common/sge_request` file. If your application is always expected to use the same values, you can override that file by creating a `$HOME/.sge_request` file. For further details, please check the `sge_request` manual page.

===== Examples =====
Line 151: Line 108:
{{{
% qsub -l mem_free=4G ''jobfile''
}}}
{{{#!highlight console numbers=disable
% qsub -l mem_free=4G <jobfile>}}}
Line 157: Line 113:
{{{
% qsub -l h_rss=4G,mem_free=3G ''jobfile''
}}}
{{{#!highlight console numbers=disable
% qsub -l h_rss=4G,mem_free=3G <jobfile>}}}
Line 163: Line 118:
{{{
% qsub -l h_rss=4G,mem_free=4G ''jobfile''
}}}

For serial jobs requiring more than 5 GB of memory, submission requesting the '''''highmem''''' flag is necessary. Using this flag, the <code>h_rss</code> limit will be unset, but the requirement tuning described above still applies. If your group is allowed to request it, and your job needs 20GB of memory, you can request it as follows:

{{{
% qsub -P ''projectname'' -l highmem,mem_free=20G ''jobfile''
}}}

* A job that might reach 30 GB and needs 20GB will be submitted as:
 
{{{
% qsub -P ''projectname'' -l highmem,h_rss=30G,mem_free=20G ''jobfile''
}}}

* For jobs using [[MPI]], please refer to the [[http://wiki.ifca.es/e-ciencia/index.php/GRIDUI_Cluster#Parallel_jobs|Parallel job submission]] section.

<!--
==== Scratch space ====

The scratch area for the jobs submitted to the cluster is located under <code>/tmp/</code> and is pointed by the <code>$TMPDIR</code> variable.

By default, jobs request a 2GB scratch area. Should you need more space, please use the <code>scract_space</code> in your resource requiremens:

% qsub -l scratch_space=20G

Please note that this space is dynamic. To check the current disk usage in the nodes you can issue:

% qhost -F scratch_space
-->
{{{#!highlight console numbers=disable
% qsub -l h_rss=4G,mem_free=4G <jobfile>}}}

For serial jobs requiring more than 5 GB of memory, submission requesting the '''''highmem''''' flag is necessary. Using this flag, the `h_rss` limit will be unset, but the requirement tuning described above still applies. If your group is allowed to request it, and your job needs 20GB of memory, you can request it as follows:

{{{#!highlight console numbers=disable
% qsub -P <project name> -l highmem,mem_free=20G <jobfile>}}}

 * A job that might reach 30 GB and needs 20GB will be submitted as:

{{{#!highlight console numbers=disable
% qsub -P <projectname> -l highmem,h_rss=30G,mem_free=20G <jobfile>}}}

 * For jobs using [[MPI]], please refer to the [[http://wiki.ifca.es/e-ciencia/index.php/GRIDUI_Cluster#Parallel_jobs|Parallel job submission]] section.

##==== Scratch space ====
##
##
The scratch area for the jobs submitted to the cluster is located under `/tmp/` and is pointed by the `$TMPDIR` variable.
##
##
By default, jobs request a 2GB scratch area. Should you need more space, please use the `scract_space` in your resource requiremens:
##
##
% qsub -l scratch_space=20G
##
##
Please note that this space is dynamic. To check the current disk usage in the nodes you can issue:
##
##
% qhost -F scratch_space
##-->
Line 196: Line 146:

Parallel jobs can be submitted to the parallel environment <code>mpi</code>, specifying the number of slots required. SGE will try to spread them over the available resources.

{{{
% qsub -pe mpi 8 ''jobfile''
}}}

Please note that parallel jobs will be routed to queue ''parallel'' (see [[#Memory management|previous section]]). Also note that access to that queue is restricted to groups having requested it beforehand.
Parallel jobs can be submitted to the parallel environment `mpi`, specifying the number of slots required. SGE will try to spread them over the available resources.

{{{#!highlight console numbers=disable
% qsub -pe mpi 8 <jobfile>}}}

Please note that parallel jobs will be routed to queue ''parallel'' (see [[#Memory_management|previous section]]). Also note that access to that queue is restricted to groups having requested it beforehand.
Line 206: Line 154:

Interactive, short lived and high priority jobs can be send to a special queue <code>interactive.q</code> if the user's project has access to it.

{{{
% qsub -q interactive.q ''jobfile''
}}}
Interactive, short lived and high priority jobs can be send to a special queue `interactive.q` if the user's project has access to it.

{{{#!highlight console numbers=disable
% qsub -q interactive.q <jobfile>}}}
Line 216: Line 162:
Line 219: Line 164:
{{{
% qconf -srqs
}}}

In order to know the current usage of the quotas defined above, the comand <code>qquota</code> must be used:

{{{
% qquota -P ''<project> ''
}}}
{{{#!highlight console numbers=disable
% qconf -srqs}}}

In order to know the current usage of the quotas defined above, the comand `qquota` must be used:

{{{#!highlight console numbers=disable
% qquota -P <project>}}}
Line 230: Line 173:
Line 233: Line 175:
* Start datetime, and end datetime (or duration) of the reservation.
* Duration of your job(s) (i.e. h_rt for the individual jobs).
* Computational resources needed (mem_free, number of slots).

Once the request has been made, the system administrators shall give you the ID(s) of the AR created. You can submit your jobs whenever you want by issuing:

{{{
% qsub -ar ''<reservation_id>'' ''<other_job_options>''
}}}

You can submit your job(s) before the AR starts and also once it is started. However, you should take care of the duration of the reservation and your job' duration. If your job execution exceeds either the <code>h_rt</code> that it has requested or the duration of the AR it will be killed by the batch system.
 * Start datetime, and end datetime (or duration) of the reservation.
 * Duration of your job(s) (i.e. h_rt for the individual jobs).
 * Computational resources needed (mem_free, number of slots).

Once the request has been made, the system administrators will give you the ID(s) of the AR created. You can submit your jobs whenever you want by issuing:

{{{#!highlight console numbers=disable
% qsub -ar <reservation_id> <other_job_options>}}}

You can submit your job(s) before the AR starts and also once it is started. However, you should take care of the duration of the reservation and your job' duration. If your job execution exceeds either the `h_rt` that it has requested or the duration of the AR it will be killed by the batch system.
Line 247: Line 188:
Since the requested and reserved resources cannot be used for other jobs, those requested resources will be used for accounting purposes as if they were resources used by normal jobs (even in the case that the AR is unused). <strong>Please request only the resources that you need</strong>.

If you want to query the existing advance reservations, you can use the <code>qrstat</code> command. To query about an specific advance reservation, you can issue:

{{{
% qrstat -ar ''<reservation_id>''
}}}
Since the requested and reserved resources cannot be used for other jobs, those requested resources will be used for accounting purposes as if they were resources used by normal jobs (even in the case that the AR is unused). '''Please request only the resources that you need'''.

If you want to query the existing advance reservations, you can use the `qrstat` command. To query about an specific advance reservation, you can issue:

{{{#!highlight console numbers=disable
% qrstat -ar ''<reservation_id>''}}}
Line 256: Line 196:

The <tt>home</tt> directories (<code>/home/$USER</code>) are shared between the UIs and the computing nodes. There is a ''projects'' shared area (located at <code>/gpfs/csic_projects/</code>), also accessible from the UI and the computing nodes. If your group does not have this area, please contact the system administrators.
The `$HOME` directories (`/home/$USER`) are shared between the UIs and the computing nodes. There is a ''projects'' shared area (located at `/gpfs/csic_projects/`), also accessible from the UI and the computing nodes. If your group does not have this area, please open an [[http://support.ifca.es|Incidence ticket]].
Line 260: Line 199:

The shared directories <strong>are not intended for scratch</strong>, use the temporal areas of the local filesystems instead. In other words, instruct every job you send to copy the input from the shared directory to the local scratch (<tt>$TMPDIR</tt>), execute all operations there, then copy the output back to some shared area where you will be able to retrieve it comfortably from the UI.

As mentioned above, the contents of <tt>$TMPDIR</tt> are removed after job execution.
The shared directories '''are not intended for scratch''', use the temporal areas of the local filesystems instead. In other words, instruct every job you send to copy the input from the shared directory to the local scratch (`$TMPDIR`), execute all operations there, then copy the output back to some shared area where you will be able to retrieve it comfortably from the UI.

As mentioned above, the contents of `$TMPDIR` are removed after job execution.
Line 266: Line 204:

Disk quotas are enabled on both user and projects filesystems. A message with this information should be shown upon login. If you need more quota on your
user space (not in the project shared area), please contact the system administrators explaining your reasons.

If you wish to check your quota at a later time, you can use the commands <code>mmlsquota gpfs_csic</code> (for user quotas) and <code>mmlsquota -g `id -g` gpfs_projects</code> (for group quotas). A script reporting both quotas is located on <code>/gpfs/csic_projects/utils/bin/rep-user-quotas.py</code>. A sample output of the latter could be:
Disk quotas are enabled on both user and projects filesystems. A message with this information should be shown upon login. If you need more quota on your user space (not in the project shared area), please contact the system administrators explaining your reasons.

If you wish to check your quota at a later time, you can use the commands `mmlsquota gpfs_csic` (for user quotas) and `mmlsquota -g `id -g` gpfs_projects` (for group quotas). A script reporting both quotas is located on `/gpfs/csic_projects/utils/bin/rep-user-quotas.py`. A sample output of the latter could be:
Line 274: Line 210:
                    INFORMATION ABOUT YOUR CURRENT DISK USAGE                      INFORMATION ABOUT YOUR CURRENT DISK USAGE
Line 283: Line 219:
********************************************************************** ***
*******************************************************************
Line 286: Line 223:
For a basic interpretation of this output, note that the "Used" column will tell you about how much disk space you are using, whereas "Soft" denotes the limit this "Used" amount should not exceed. The "Hard" column is the value of the limit "Used" plus "Doubt" should not cross. A healthy disk space management would require that you periodically delete unused files in your <tt>$HOME</tt> directory, keeping its usage below the limits at all times. In the event that the user exceeds a limit, a grace period will be shown in the "Grace" column. If the user does not correct the situation within the grace period, she will be banned from writing to the disk.

For further information you can read the [http://www.nersc.gov/vendor_docs/ibm/gpfs/am3admst119.html mmlsquota command manual page].
For a basic interpretation of this output, note that the "Used" column will tell you about how much disk space you are using, whereas "Soft" denotes the limit this "Used" amount should not exceed. The "Hard" column is the value of the limit "Used" plus "Doubt" should not cross. A healthy disk space management would require that you periodically delete unused files in your `$HOME` directory, keeping its usage below the limits at all times. In the event that the user exceeds a limit, a grace period will be shown in the "Grace" column. If the user does not correct the situation within the grace period, she will be banned from writing to the disk.

For further information you can read the [[http://www.nersc.gov/vendor_docs/ibm/gpfs/am3admst119.html|mmlsquota command manual page]].
Line 291: Line 228:

Some extra packages as [[http://python.org|Python 2.6]] and [[http://software.intel.com/en-us/articles/non-commercial-software-development/|Intel Non-Commercial Compilers]] can be found on <code>/gpfs/csic_projects/utils/</code>.
Some extra packages as [[http://python.org|Python 2.6]] and [[http://software.intel.com/en-us/articles/non-commercial-software-development/|Intel Non-Commercial Compilers]] can be found on `/gpfs/csic_projects/utils/`.
Line 297: Line 233:

IFCA Datacenter usage guidelines

Introduction

GRIDUI (Grid User Interface) cluster is the interactive gateway for the computing resources and projects IFCA is involved on. It is a SSH LVS, based on Scientific Linux CERN versions 5. For legacy applications some nodes on Scientific Linux CERN 4 are still available, however the maintenance of these machines is made in a best-effort basis. The login machines are as follow:

HostnameDistributionArchitecture
gridui.ifca.esScientific Linux CERN SLC release 5.5 (Boron)x86_64
griduisl5.ifca.es
griduisl4.ifca.esScientific Linux CERN SLC release 4.7 (Beryllium)i386

Login on these machines is provided via Secure Shell. You can use the standard port 22 or the 22000.

Please note that this cluster is not intended for the execution of CPU intensive tasks, for this purpose use any of the available computing resources.

Outgoing SSH connections are not allowed by default from this cluster. Inactive SSH sessions will be closed after 24h.

Authentication

Authentication is centralized via secured LDAP. All the changes made to a user account in one node take immediate effect in the whole cluster. There is also a secured web interface, that allows a user to change his/her details, available at https://cerbero.ifca.es/. If you need to reset your account password, please contact the system administrators.

With this username you should be able to access also the ticketing system at http://support.ifca.es/

Grid resources

As the cluster is based on several Grid User Interfaces it allows users to access either EGEE-III, int.eu.grid, EUFORIA, IBERGRID and GRID-CSIC infrastructures. To set up the correct environment variables, please source any of the environment scripts located under /gpfs/csic_projects/grid/etc/env/. For example, to use the I2G infrastructure:

% source /gpfs/csic_projects/grid/etc/env/i2g-env.sh 

The available enviroments are:

Filename Allows access to
euforia-env.{csh,sh} EUFORIA
i2g-env.{csh,sh} int.eu.grid
ibergrid-env.{csh,sh} IBERGRID and EGI
ngi-env.{csh,sh} EGI and IBERGRID NGI

More information on seting up the Grid UI is available locally at the corresponding DUS page.

PBS Cluster

PBS cluster was decommissioned in December 2009. Please use the new SGE Cluster instead.

SGE Cluster

The SGE Cluster is based on Scientific Linux CERN SLC release 5.5 machines, running on x86_64. The exact number or resources available is shown on the monitorization page.

Job submission

Local submission is allowed for certain projects. As stated below, there are some shared areas that can be accessed from the computing nodes. The underlying batch system is http://gridengine.sunsource.net/%20Sun%20Grid%20Engine. Please note that the syntax for job submission (qsub) and monitoring (qstat) is similar to the one you might be accustomed to from PBS, but there are important differences. Refer to the following sources for information:

* Grid Engine documentation page * Some useful HOWTOS * SGE basic usage.

Users should submit their jobs directly to their project by using qstat -P <project>. In other words, instead of:

% qsub -q lhidra <jobfile>

you should write (note the dot between the "l" and the project name):

% qsub -P l.hidra <jobfile>

If a project name is not specified, the job will fall in the catch-all low-priority queue. Submission without specifying a project is allowed, in order to perform short test jobs. However, it is in the best interest of the user to use her project in general, so that the job can take full advantage of more capable (longer, more CPUs, higher priority) queues.

A scratch area is defined for every job as the environment variable $TMPDIR. This area is cleared after the job has exited.

Wall Clock time

A default wall clock time of 72 hours is enforced by default in all jobs submitted to the cluster. Should you require a higher value, please set it by yourself by requesting a new h_rt value in the form hours:minutes:seconds. Please note that requesting a high value may impact negatively in your job scheduling and execution. Please try to be as accurate as possible when setting this required value. For example, a job requiring 23h should be sent as follows:

% qsub -P l.hidra -l h_rt=24:00:00 <jobfile>

Memory management

When requesting memory for a job you must take into account that per-job memory is limited in the default queues to a [http://en.wikipedia.org/wiki/Resident_set_size Resident Set Size] (h_rss) of 5 GB. If you need to use more memory, you should request the special resource highmem. Please notice that your group may not be able to request that flag by default. If you need to do so, please [http://wiki.ifca.es/e-ciencia/index.php/GRIDUI_Cluster#Support Open a ticket] requesting it.

However, it is highly recommended that you tune your memory requirements to some realistic values. Special emphasis is made in the following resources:

  • h_rss

  • mem_free

h_rss

The first one (h_rss) refers to the hard resident set size limit. The batch system will make sure a given job does not consume more memory than the value assigned to that variable. This means that any job above the requested h_rss limit will be killed (SIGKILL) by the batch system. It is recommended to request this resource as a top limit for your application. If you expect your job to consume no more than a peak value of 3GB you should request those 3GB as its resident set size limit. This request will not produce a penalty on the scheduling of your jobs.

mem_free

The second one (mem_free) refers to the free RAM necessary for the job to run. The batch system will allow jobs to run only if sufficient memory (as requested by mem_free) is available for them. It will also subtract that amount of memory from the available resources, once the job is running. This ensures that a node with 16 GB of memory will not run jobs totalling more than 16 GB. The default value is 1.8 GB per slot. Please note that breaking the mem_free limit will not automatically kill your job. Its aim is just to ensure that your job has available the memory you requested. Also note that this value is not intended to be use to reflect the memory peaks of your job. This request will impact the scheduling of your jobs, so it is highly recommended to tune it to fit your application memory usage.

This limit is defined in the /opt/gridengine/default/common/sge_request file. If your application is always expected to use the same values, you can override that file by creating a $HOME/.sge_request file. For further details, please check the sge_request manual page.

Examples
  • A job that needs to have 4 GB of memory assigned to it:

% qsub -l mem_free=4G <jobfile>
  • A job that might peak at 4 GB, but in its execution normally needs 3 GB:

% qsub -l h_rss=4G,mem_free=3G <jobfile>
  • A job that might reach 4 GB, and also needs 4 GB:

% qsub -l h_rss=4G,mem_free=4G <jobfile>

For serial jobs requiring more than 5 GB of memory, submission requesting the highmem flag is necessary. Using this flag, the h_rss limit will be unset, but the requirement tuning described above still applies. If your group is allowed to request it, and your job needs 20GB of memory, you can request it as follows:

% qsub -P <project name> -l highmem,mem_free=20G <jobfile>
  • A job that might reach 30 GB and needs 20GB will be submitted as:

% qsub -P <projectname> -l highmem,h_rss=30G,mem_free=20G <jobfile>

Parallel jobs

Parallel jobs can be submitted to the parallel environment mpi, specifying the number of slots required. SGE will try to spread them over the available resources.

% qsub -pe mpi 8 <jobfile>

Please note that parallel jobs will be routed to queue parallel (see previous section). Also note that access to that queue is restricted to groups having requested it beforehand.

Interactive jobs

Interactive, short lived and high priority jobs can be send to a special queue interactive.q if the user's project has access to it.

% qsub -q interactive.q <jobfile>

Execution on this queue is limited to a maximum of 1h of WALL clock time.

Resource quotas

Some limits may be enforced by the administrators in a user/group/project basis. To check the current resource quotas, the following command must be issued:

% qconf -srqs

In order to know the current usage of the quotas defined above, the comand qquota must be used:

% qquota -P <project>

Advanced reservation

Some users and/or projects might request a reservation of a set of resources in advance. This is called an "Advanced Reservation (AR). If your project needs such a reservation you should make a petition using the support helpdesk. You need to specify the following:

  • Start datetime, and end datetime (or duration) of the reservation.
  • Duration of your job(s) (i.e. h_rt for the individual jobs).
  • Computational resources needed (mem_free, number of slots).

Once the request has been made, the system administrators will give you the ID(s) of the AR created. You can submit your jobs whenever you want by issuing:

% qsub -ar <reservation_id> <other_job_options>

You can submit your job(s) before the AR starts and also once it is started. However, you should take care of the duration of the reservation and your job' duration. If your job execution exceeds either the h_rt that it has requested or the duration of the AR it will be killed by the batch system.

You should also take into account that your reservation might not be created in the date and time that you requested if there are no resources available. In this case, it will be created whenever it is possible. To avoid this, please request your reservations well in advance.

Since the requested and reserved resources cannot be used for other jobs, those requested resources will be used for accounting purposes as if they were resources used by normal jobs (even in the case that the AR is unused). Please request only the resources that you need.

If you want to query the existing advance reservations, you can use the qrstat command. To query about an specific advance reservation, you can issue:

% qrstat -ar ''<reservation_id>''

Shared areas

The $HOME directories (/home/$USER) are shared between the UIs and the computing nodes. There is a projects shared area (located at /gpfs/csic_projects/), also accessible from the UI and the computing nodes. If your group does not have this area, please open an Incidence ticket.

Usage

The shared directories are not intended for scratch, use the temporal areas of the local filesystems instead. In other words, instruct every job you send to copy the input from the shared directory to the local scratch ($TMPDIR), execute all operations there, then copy the output back to some shared area where you will be able to retrieve it comfortably from the UI.

As mentioned above, the contents of $TMPDIR are removed after job execution.

Disk quotas

Disk quotas are enabled on both user and projects filesystems. A message with this information should be shown upon login. If you need more quota on your user space (not in the project shared area), please contact the system administrators explaining your reasons.

If you wish to check your quota at a later time, you can use the commands mmlsquota gpfs_csic (for user quotas) and mmlsquota -g id -g gpfs_projects (for group quotas). A script reporting both quotas is located on /gpfs/csic_projects/utils/bin/rep-user-quotas.py. A sample output of the latter could be:

**********************************************************************
                    INFORMATION ABOUT YOUR CURRENT DISK USAGE

USER                Used      Soft      Hard     Doubt     Grace
Space (GB):         3.41     20.00      0.00      0.06      none
Files (x1000):        64         0         0         0      none

GROUP               Used      Soft      Hard     Doubt     Grace
Space (GB):         0.00   1000.00   1500.00      0.00      none
Files (x1000):         0         0         0         0      none
***
*******************************************************************

For a basic interpretation of this output, note that the "Used" column will tell you about how much disk space you are using, whereas "Soft" denotes the limit this "Used" amount should not exceed. The "Hard" column is the value of the limit "Used" plus "Doubt" should not cross. A healthy disk space management would require that you periodically delete unused files in your $HOME directory, keeping its usage below the limits at all times. In the event that the user exceeds a limit, a grace period will be shown in the "Grace" column. If the user does not correct the situation within the grace period, she will be banned from writing to the disk.

For further information you can read the mmlsquota command manual page.

Extra utils

Some extra packages as Python 2.6 and Intel Non-Commercial Compilers can be found on /gpfs/csic_projects/utils/.

Please note that these packages are provided as-is, without further support from IFCA staff.

Support

Questions, support and/or feedback should be directed through the use the Helpdesk.


CategoryDatacenter

eciencia: Cluster/Usage (last edited 2017-02-17 08:58:31 by aloga)