Troubleshooting Guide
Contents
Check the mpi-start page for more information.
Configuration
yaim plugin does not publish MPI-* tags into the RunTimeEnvironment
Check that you are using the MPI_CE node type and that it is the first node type in the command line.
Check the yaim log for messages like Added <FLAVOUR> to set to CE_RUNTIMEENV, if they does not appear, probably you have not enabled any flavour in your yaim profile.
Compilation
Compiler not found
While it is advised that sites supporting MPI do install the MPI compiler, it may not available at all sites. Some sites do have installed the -devel packages but do not have the proper compiler (gcc) installed. In the case of Open MPI a message like this is shown:
-------------------------------------------------------------------------- The Open MPI wrapper compiler was unable to find the specified compiler gcc in your PATH. Note that this compiler was either specified at configure time or in one of several possible environment variables. --------------------------------------------------------------------------
You should install the C/C++/Fortran compilers to fully support the compilation of MPI applications.
Incompatible Libraries
The available compiler does not match the installed libraries, or compiler paths are set incorrectly. mpi-start should fix the compiler flags (32/64 bits) and set them in the MPI_<COMPILER>_FLAGS where COMPILER is one of MPICC (C), MPICXX (C++), MPIF90 (Fortran 90) or MPIF70 (Fortran 70). Use one of those variables for your compilation.
Execution
Mpiexec errors
Some sites have reported errors related to bad usage of OSC Mpiexec. Sample error messages:
error while loading shared libraries: libtorque.so.0: cannot open shared object file: No such file or directory
mpiexec: Error: PBS_JOBID not set in environment.
These errors are due to using the wrong version of Mpiexec for the installed torque, or trying to use this starter in a non PBS site.
Bad allocation of slots
mpi-start will use all the allocated slots by the batch scheduler (unless told otherwise), but wrongly configured CE/batch system may prevent the proper execution of the jobs.
Cannot find scheduler
If mpi-start is not able to detect the batch system being used, it will issue a cannot find scheduler error message and exit with code 3. This is normally due to misconfiguration of the batch system or the Computing Element. Check next sections for more information.
PBS/Torque
CREAM BLAH produces a job for PBS/Torque with the -l nodes=N option (with N equal to the number of required slots). This option fails for any N bigger than the number of WN in the site. In order to avoid it, you should configure a submit filter in torque that will transform that expression into one adapted to your hardware configuration. The yaim plugin can automatically create such filter if the MPI_SUBMIT_FILTER is set to "yes" in your configuration file.
SGE
Support for MPI jobs in SGE requires the configuration of a Parallel Environment and enabling it for submission of jobs from the Computing Element. Current CREAM SGE support selects any parallel environment (uses -pe * option) available. If your job fails to start with a cannot find scheduler error from mpi-start, probably the parallel environment is not properly configured.
File transfer
mpi-start tries to copy files to remote hosts if it does not find a shared filesystem. The shared filesystem detection is limited to the working directory, if your site uses a shared space which is not where the job starts, mpi-start can be configured to use that area with the MPI_SHARED_HOME and MPI_SHARED_HOME_PATH. Check the manual for details.
The fail-over method for copying the files is ssh (scp). This requires passwordless ssh configured between the nodes. If mpi-start is not able to login into the remote host it will display failed to create directory on remote machine error message. Check that passwordless ssh is working between your nodes if you get that message.
Last update at 2011-09-20T07:40:38Z