welcome: please sign in
location: Diff for "Supercomputing/Userguide"
Differences between revisions 13 and 14
Revision 13 as of 2012-07-23 11:42:32
Size: 19425
Editor: cabellos
Comment:
Revision 14 as of 2012-07-23 11:44:41
Size: 17139
Editor: cabellos
Comment:
Deletions are marked like this. Additions are marked like this.
Line 125: Line 125:
== Running Jobs ==
SLURM is the utility used at Altamira for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at Altamira.

In order to keep the login nodes in a propper load, a 10 minutes limitation in the cpu time is set for processes running interactively in these nodes. Any execution taking more than this limit should be carried out through the queue system.

=== Submitting Jobs ===
A job is the execution unit for the SLURM. A job is defined by a text file containing a set of directives describing the job, and the commands to execute.

These are the basic directives to submit jobs:
 * `mnsubmit <job_script>` submits a ''job script'' to the queue system (see below for job script directives).
 * `mnq` shows all the jobs submitted.
 * `mncancel <job_id>` removes his/her job from the queue system, canceling the execution of the job if it was already running.

=== Job directives ===
A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax:
{{{
# @ directive = value
}}}
Additionally, the job script may contain a set of commands to execute. If not, an external script must be provided with the 'executable' directive. Here you may find the most common directives:
{{{# @ class = class_name}}}
The queue where the job is to be submitted. Let this field empty unless you need to use "interactive"
or "debug" queues.
# @ wall_clock_limit = HH:MM:SS
The limit of wall clock time. This is a mandatory field and you must set it to a value greater than the
real execution time for your application and smaller than the time limits granted to the user. Notice
that your job will be killed after the elapsed period
# @ initialdir = pathname
The working directory of your job (i.e. where the job will run). If not specified, it is the current
working directory at the time the job was submitted.
# @ error = file
The name of the file to collect the stderr output of the job.
# @ output = file
The name of the file to collect the standard output (stdout) of the job.
# @ total_tasks = number
The number of processes to start.
# @ cpus_per_task = number
<<Include (/running)>>

Altamira User's Guide

Introduction

This user's guide for the Altamira supercomputer is intended to provide the minimum amount of information needed by a new user of this system. As such, it assumes that the user is familiar with many of the standard aspects of supercomputing as the Unix operating system.

We hope you can find most of the information you need to use our computing resources: from applications and libraries to technical documentation about Altamira. Please read carefully this document and if any doubt arises don't hesitate to contact our support group at <res_support@ifca.unican.es>.

System Overview

Altamira comprises 156 compute nodes, 5 GPU compute nodes, a login server and several service servers. Every compute node has two processors at 2.6 GHz running Scientific Linux 6.2 system with 64 GB of memory RAM and 400 GB local disk storage. All the servers provide a total of 132 TB of disk storage accessible from every compute node through GPFS (Global Parallel File System).

The networks that interconnect the Altamira are:

  • Infiniband Network: High bandwidth network used by parallel applications communications.
  • Gigabit Network: Ethernet network used by the management server.

Connecting to Altamira

Once you have a login and its associated password you can get into Altamira system, connecting to the login node altamira1.ifca.es.

You must use Secure Shell (ssh) tools to login into or transfer file into Altamira. We do not accept incoming connections from protocols as telnet, ftp, rlogin, rcp, or rsh commands. Once you are logged into Altamira you cannot make outgoing connections for security reasons.

To get more information about the secure shell version supported and how to get ssh for your system (including windows systems) see Appendix A.

Here you have an example of logging into Altamira from a UNIX environment:

user@localsystem:~$ ssh -l usertest altamira1.ifca.es
usertest@altamira1.ifca.es's password: 

/--------------------------------------------------------------\
|                      Welcome to Altamira                     |
|                                                              |
|  - Applications are located at /gpfs/res_apps                |
|  - For futher information read User's Guide at               |
|      http://grid.ifca.es/wiki/Ojancano/Userguide             |
|                                                              |
|   Please contact res_support@ifca.unican.es for questions    |
|                                                              |
\--------------------------------------------------------------/

[usertest@login1 ~]$ 

The first time that you connect to the Altamira system secure shell needs to interchange some initial information to establish the communication. This information consists of the acceptance of the RSA key of the remote host, you must answer 'yes' or 'no' to confirm the acceptance of this key.

Please change your initial password after you login the first time into the machine. Also use a strong password (do not use a word or phrase from a dictionary and do not use a word that can be obviously tied to your person). Finally, please make a habit of changing your password on a regular basis.

If you cannot get access to the system after following this procedure, first consult Appendix A for an extended information about Secure Shell, or you can contact us, (see Getting Help to know how to contact with us).

Login Node

Once inside the machine you will be presented with a UNIX shell prompt and you'll normally be in your home ($HOME) directory. If you are new to UNIX, you'll have to learn the basics before you could do anything useful.

The machine in which you will be logged in will be the login node of Altamira (login1). This machine act as front end, and it is used typically for editing, compiling, preparation/submition of batch executions and as a gateway for copying data inside or outside Altamira.

It is not permitted the execution of cpu-bound programs on this node, if some compilation needs much more cputime than the permitted, this needs to be done through the batch queue system. It is not possible to connect directly to the compute nodes from the login node, all resource allocation is done by the batch queue system.

Transferring Files

As it have been said before no connections are allowed from inside Altamira to the outside world, so all scp and sftp commands have to be executed from your local machines and not inside Altamira.

Here there are some examples of each of this tools transferring files to Altamira:

localsystem$ scp localfile usertest@altamira1.ifca.es:
usertest@altamira1.ifca.es's password:

localsystem$ sftp usertest@altamira1.ifca.es
usertest@altamira1.ifca.es's password:
sftp> put localfile
sftp> exit

These are the ways to retrieve files from Altamira to your local machine:

localsystem$ scp usertest@altamira1.ifca.es:remotefile localdir
usertest@altamira1.ifca.es's password:

localsystem$ sftp usertest@mn1.bsc.es
usertest@altamira1.ifca.es's password:
sftp> get remotefile
sftp> exit

On a Windows system, most of the secure shell clients comes with a tool to make secure copies or secure ftp's. There are several tools that accomplishes the requirements, please refer to the Appendix A, where you will find the most common ones and examples of use.

File Systems

IMPORTANT It is your responsibility as a user of the Altamira system to backup all your critical data.

Each user has several areas of disk space for storing files. These areas may have size or time limits, please read carefully all this section to know about the policy of usage of each of these filesystems. There are 3 different types of storage available inside a node:

  • Root Filesystem: Is the filesystem where the operating system resides
  • GPFS Filesystems: GPFS is a distributed networked filesystem which can be accessed from all the nodes
  • Local Hard Drive: Every compute node has an internal hard drive

Root Filesystem

The root file system, where the operating system is installed in each compute node. It is NOT permitted the use of /tmp for temporary user data. The local hard drive can be used for this purpose as you could read in section about the Local Hard Drive.

Furthermore, the environment variable $TMPDIR is already configured to force the normal applications to use the local hard drive to store their temporary files.

GPFS Filesystems

The IBM General Parallel File System (GPFS) is a high-performance shared-disk file system that can provide fast, reliable data access from all blades of the cluster to a global filesystem. GPFS allows parallel applications simultaneous access to a set of files (even a single file) from any node that has the GPFS file system mounted while providing a high level of control over all file system operations. These filesystems are the recommended to use with most jobs, because GPFS provides high-performance I/O by "striping" blocks of data from individual files across multiple disks on multiple storage devices and reading/writing these blocks in parallel. In addition, GPFS can read or write large blocks of data in a single I/O operation, thereby minimizing overhead.

These are the GPFS filesystems available in Altamira from all nodes:

  • /gpfs/res_home: Soft link to GPFS folder. This filesystem has the home directories of all the users, when you log into Altamira you start in your home directory by default. Every user will have their own home directory to store the executables, own developed sources and their personal data. Quotas are in effect that limit the amount of data that can be saved here, a default quota will be enforced to all users.

If you need more disk space in this filesystem or in any other of the GPFS filesystems, the responsible of your project has to make a request for this extra space needed, specifying the requested space and the reasons why it is needed. The request can be sent by email or any other way of contact to the user support team as it is explained in Getting Help Section.

  • /gpfs/res_projects: In addition to the home directory, there is a directory in /gpfs/res_projects for each group of users of Marenostrum. For instance, the group bsc01 will have a /gpfs/res_projects/bsc01 directory ready to use. This space is intended to store data that needs to be shared between the users of the same group or project. A quota per group will be enforced depending on the space assigned by Access Comitee.

All the users of the same project will share their common /gpfs/res_projects space and it is responsibility of each project manager to determine and coordinate the better use of this space, and how it is distributed or shared between their users. If a project needs more disk space in this filesystem or in any other of the GPFS filesystems, the project manager has to make a request for this extra space needed, specifying the space needed and the reasons why it is needed. The request can be sent by email or any other way of contact to the user support team as it is explained in Getting Help Section.

  • /gpfs/res_scratch: Each Altamira user will have a directory over /gpfs/res_scratch, you must use this space to store temporary files of your jobs during its execution. By default, files may reside for up to 7 days without modification in this filesystem, any older file might be removed. A quota per group will be enforced depending on the space assigned.

  • /gpfs/res_apps: Over this filesystem will reside the applications and libraries that have already been installed on Altamira. Take a look at the directories or go to Software section to know the applications available for general use. Before installing any application that is needed by your project, first check if this application is already installed on the system. If some application that you need is not on the system, you will have to ask our user support team to install it. Check Getting Help Section how to contact with us. If it is a general application with no restrictions in his use, this will be installed over a public directory, that is over /gpfs/res_apps so all users on Altamira could make use of it. If the application needs some type of license and his use must be restricted, a private directory over /gpfs/res_apps will be created, so only the required users of Altamira could make use of this application. All applications on /gpfs/res_apps will be installed, controlled and supervised by the user support team. This doesn't mean that the users could not help in this task, both can work together to get the best result. The user support can provide his wide experience in compiling and optimizing applications in the Altamira cluster and the users can provide his knowledge of the application to be installed. All that general applications that have been modified in some way from its normal behavior by the project users' for their own study, and may not be suitable for general use, must be installed over /gpfs/res_projects or /gpfs/res_home depending on the usage scope of the application, but not over /gpfs/res_apps.

Local Hard Drive

Every node has a local hard drive that can be used as a local scratch space to store temporary files during executions of one of your jobs. This space is mounted over /scratch directory. The amount of space within the /scratch filesystem varies from node to node (depending on the total amount of disk space available). All data stored in these local hard drives at the compute nodes will not be available from the login nodes. Local hard drive data is not automatically removed, so each job should have to remove its data when finishes. The jobs should use $TMPDIR enviroment variable that is set to local scratch folder for each job.

<<Include (/running)>>

Software

Modules Enviroment

The Environment Modules package provides for the dynamic modification of a user's environment via modulefiles. Each modulefile contains the information needed to configure the shell for an application or a compilation. Modules can be loaded and unloaded dynamically and atomically, in a clean fashion. All popular shells are supported, including bash, ksh, zsh, sh, csh, tcsh, as well as some scripting languages such as perl.

The most important commands of module tool are: list, avail, load, unload, switch and purge

  • module list shows all the modules you have loaded:

[usertest@login1 ~]$ module list
Currently Loaded Modulefiles:
  1) gcc/4.6.3   2) GHC/7.4.2   3) openmpi-x86_64
  • module avail shows all the modules that user is able to load:

[usertest@login1 ~]$ module avail
---------------------------- /usr/share/Modules/modulefiles --------------------------
dot         module-cvs  module-info modules     null        use.own

---------------------------------- /etc/modulefiles ----------------------------------
mvapich2-x86_64 openmpi-x86_64

------------------------------------- compilers --------------------------------------
GHC/7.4.1          GHC/7.4.2(default) gcc/4.6.3(default)

------------------------------------ applications ------------------------------------
CMAKE/2.8.7(default)      R/2.15.1(default)         SOAPdenovo/1.05(default)
MIRA/3.4.0.1(default)     SIESTA/3.1(default)       TRINITY_RNA_SEQ/r2012-06-8(default)
  • module load let user load the necessary environment variables for the selected modulefile (PATH, MANPATH, LD_LIBRARY_PATH...etc)

[usertest@login1 ~]$ module load CMAKE
load CMAKE/2.8.7 (PATH,MANPATH)
  • module unload removes all environment changes made by module load command:

[usertest@login1 ~]$ module unload GHC
remove GHC/7.4.2 (PATH,LD_LIBRARY_PATH,MANPATH)
  • module switch acts as module unload and module load command at same time:

[usertest@login1 ~]$ module load GHC
load GHC/7.4.2 (PATH,LD_LIBRARY_PATH,MANPATH)
[usertest@login1 ~]$ module switch GHC GHC/7.0.1
switch1 GHC/7.4.2 (PATH,LD_LIBRARY_PATH,MANPATH)
switch2 GHC/7.0.1 (PATH,LD_LIBRARY_PATH,MANPATH)
switch3 GHC/7.4.2 (PATH,LD_LIBRARY_PATH,MANPATH)
ModuleCmd_Switch.c(278):VERB:4: done

Acknowledgment in publications

Getting Help

IFCA provides to users consulting assistance. User support consultants are available during normal business hours, Monday to Friday, 09 a.m. to 18 p.m. (CEST time).

User questions and support are handled at: <res_support@ifca.unican.es> If you need assistance, please supply us with the nature of the problem, the date and time that the problem occurred, and the location of any other relevant information, such as output

Appendices

A. SSH

SSH is a program that enables secure logins over an insecure network. It encrypts all the data passing both ways, so that if it is intercepted it cannot be read. It also replaces the old an insecure tools like telnet, rlogin, rcp, ftp,etc. SSH is a client-server software. Both machines must have ssh installed for it to work.

We have already installed a ssh server in our machines. You must have installed an ssh client in your local machine. SSH is available without charge for almost all versions of Unix. We recommend the use of OpenSSH client that can be download from http://www.openssh.org, but any client compatible with SSH version 2 can be used.

In windows systems we recommend the use of putty. It is a free SSH client that you can download from http://www.putty.nl/. But you can also, any client compatible with SSH version 2 can be used.

To transfer files to or from Altamira you need a secure ftp (sftp) o secure copy (scp) client. There are several different clients, but as previously mentioned, we recommend the use of putty clients for transferring files: psftp and pscp. You can find it at the same web page as putty ( http://www.putty.nl/ ).

Some other possible tools for users requiring graphical file transfers could be:

For using psftp you need to pass it the machine name (altamira1.ifca.es), and then the username and passwd. Once you are connected, it's like a Unix command line. With command help you will obtain a list of all possible commands. But the most useful are:

get file_name
To transfer from Altamira to your local machine.
put file_name
To transfer a file from your local machine to Altamira.
cd directory
To change remote working directory.
dir
To list contents of a remote directory.
lcd directory
To change local working directory.
!dir
To list contents of a local directory.

eciencia: Supercomputing/Userguide (last edited 2018-10-18 10:09:02 by aidaph)