welcome: please sign in
location: Diff for "Middleware/MpiStart/TroubleshootingGuide"
Differences between revisions 5 and 6
Revision 5 as of 2011-09-19 20:33:22
Size: 4080
Editor: enol
Comment:
Revision 6 as of 2011-09-19 21:12:51
Size: 2878
Editor: enol
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
<<TableOfContents(2)>> <<TableOfContents(3)>>
Line 5: Line 5:
Check the [[..|mpi-start]] page for more information on mpi-start Check the [[../|mpi-start]] page for more information.
Line 8: Line 8:
== Configuration ==
Line 9: Line 10:
== ./mpi.sh: line 34: /opt/i2g/bin/mpi-start: No such file or directory == === yaim plugin does not publish MPI-* tags into the RunTimeEnvironment ===
Line 11: Line 12:
=== Cause ===
Mpi-start is not installed
 1. Check that you are using the `MPI_CE` node type and that it is the '''first''' node type in the command line.
 1. Check the yaim log for messages like `Added <FLAVOUR> to set to CE_RUNTIMEENV`, if they does not appear, probably you have not enabled any flavour in your yaim profile.
Line 14: Line 15:
=== Solution ===
Install MPI-START
== Compilation ==
Line 17: Line 17:
== mpiexec: error while loading shared libraries: libtorque.so.0: cannot open shared object file: No such file or directory == === Compiler not found ===
Line 19: Line 19:
=== Cause ===
The installed mpiexec version is wrong for installed version of torque.
While it is advised that sites supporting MPI do install the MPI compiler, it may not available at all sites. Some sites do have installed the -devel packages but do not have the proper compiler (`gcc`) installed. In the case of Open MPI a message like this is shown:
Line 22: Line 21:
=== Solution ===
Install the correct mpiexec version
{{{
--------------------------------------------------------------------------
The Open MPI wrapper compiler was unable to find the specified compiler
gcc in your PATH.
Line 25: Line 26:
== mpiexec: cannot connect to local mpd == Note that this compiler was either specified at configure time or in
one of several possible environment variables.
--------------------------------------------------------------------------
}}}
Line 27: Line 31:
=== Cause ===
Unknown

=== Solution ===
Unknown

== skipping incompatible /XXXX/libmpich.a when searching for -lmpich ==

=== Cause ===
Available compiler do not match the installed libraries, or compiler paths incorrectly set.

=== Solution ===
Install newer version of mpi-start that should fix the compiler flags (32/64 bits) or set correct compiler flags.

== mpiexec: Error: PBS_JOBID not set in environment. ==

=== Cause ===
pbs/torque mpiexec installed in a non PBS site

=== Solution ===
Install the correct mpiexec or remove the current one

== I2G_MPI_START variable is not set! ==

=== Cause ===
The environment variable I2G_MPI_START is not set although announced in the site BDII

=== Solution ===
Set the environment variable to the correct path

== mpicc: command not found ==

=== Cause ===
MPI compiler not available

=== Solution ===
Install the mpi compiler

== /opt/i2g/bin/../etc/mpi-start/openmpi.mpi: line 70: MPI_SPECIFIC_PARAMS+=-x X509_USER_PROXY --prefix /opt/i2g/openmpi : No such file or directory ==

=== Cause ===
An old version of mpi-start is installed (pre 0.0.54)

=== Solution ===
Install a newer version.

== which: no mpiexec in … ==

=== Cause ===
Environment not set correctly

=== Solution ===
Set the MPI_<flavour>_PATH correctly.

== mpiexec: Warning: task 0 exited before completing MPI startup. mpiexec: Warning: task 1 was never spawned due to earlier errors. ==

=== Cause ===
Unknown

=== Solution ===
Unknown

== Timeout when executing test MPI-sft-mpich after 600 seconds! ==
=== Cause ===
Unknown

=== Solution ===
Unknown

== /opt/mpich-1.2.7p1//bin/mpicc: line 326: cc: command not found ==
=== Cause ===
Compiler not installed

=== Solution ===
Install the development related RPMs (e.g. opempi-devel, mpich2-devel)

== cannot find scheduler ==
=== Cause ===
Jobmanager for SGE not configured properly to submit jobs to a PE.

=== Solution ===
Configure the jobmanager

== ... mpi.h: No such file or directory ==

=== Cause ===
Most likely, devel packages not installed.

=== Solution ===
Install devel packages
You should install the C/C++/Fortran compilers to fully support the compilation of MPI applications.
Line 119: Line 34:
== CpuNumber: attribute cannot be specified with non MPICH jobs ==
=== Cause ===
Unpatched SL4 version of CREAM-CE installed.
=== Incompatible Libraries ===
Line 123: Line 36:
=== Solution ===
Bug #56762: since gLite 3.1 Update 56 (patch #3259) it is not possible to specify NodeNumber or CpuNumber in the JDL when JobType is Normal.
The available compiler does not match the installed libraries, or compiler paths are set incorrectly. mpi-start should fix the compiler flags (32/64 bits) and set them in the `MPI_<COMPILER>_FLAGS` where COMPILER is one of `MPICC` (C), `MPICXX` (C++), `MPIF90` (Fortran 90) or `MPIF70` (Fortran 70). Use one of those variables for your compilation.
Line 126: Line 38:
Waiting for the patch fixing this problem, please apply the following workaround:
 * Replace $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/glite-jdl-api-java.jar with this file: [[http://grid.pd.infn.it/cream/patch-glite-jdl-api-java/glite-jdl-api-java.jar|glite-jdl-api-java.jar]]
 * Restart tomcat
== Execution ==
Line 130: Line 40:
This problem doesn't affect the CREAM CE version for gLite 3.2/sl5 === mpiexec errors ===
Line 132: Line 42:
== mpiexec was unable to launch the specified application as it could not find an executable: ==
=== Cause ===
Open MPI mpiexec used instead of mpich's
Some sites have reported errors related to bad usage of [[http://www.osc.edu/~djohnson/mpiexec/index.php|OSC Mpiexec]]. Sample error messages:
{{{
error while loading shared libraries: libtorque.so.0: cannot open shared object file: No such file or directory
}}}
{{{
mpiexec: Error: PBS_JOBID not set in environment.
}}}
Line 136: Line 50:
=== Solution ===
Set the environment variables to the appropriate mpich location
These errors are due to using the wrong version of Mpiexec for the installed torque, or trying to use this starter in a non PBS site.
Line 139: Line 52:
=== `cannot find scheduler` ===
Line 140: Line 54:
== /opt/mpiexec-0.82/bin/mpiexec: No such file or directory ==
=== Cause ===
The MPI_XXXX_MPIEXEC variable is defined to a mpiexec version, but the binary is not installed or cannot be found.
If mpi-start is not able to detect the batch system being used, it will issue a `cannot find scheduler` error message and exit with code 3. This is normally due to misconfiguration of the batch system or the Computing Element.
Line 144: Line 56:
=== Solution ===
Set the variable to the correct location of mpiexec, install the package, or check that the filesystem where the binary is located is correctly mounted at the Worker Nodes.
==== SGE ====

Support for MPI jobs in SGE requires the configuration of a Parallel Environment and enabling it for submission of jobs from the Computing Element. Current CREAM SGE support selects any parallel environment (uses `-pe *` option) available. If your job fails to start with a `cannot find scheduler` error from mpi-start, probably the parallel environment is not properly configured.

Troubleshooting Guide

Check the mpi-start page for more information.

Configuration

yaim plugin does not publish MPI-* tags into the RunTimeEnvironment

  1. Check that you are using the MPI_CE node type and that it is the first node type in the command line.

  2. Check the yaim log for messages like Added <FLAVOUR> to set to CE_RUNTIMEENV, if they does not appear, probably you have not enabled any flavour in your yaim profile.

Compilation

Compiler not found

While it is advised that sites supporting MPI do install the MPI compiler, it may not available at all sites. Some sites do have installed the -devel packages but do not have the proper compiler (gcc) installed. In the case of Open MPI a message like this is shown:

--------------------------------------------------------------------------
The Open MPI wrapper compiler was unable to find the specified compiler
gcc in your PATH.

Note that this compiler was either specified at configure time or in
one of several possible environment variables.
--------------------------------------------------------------------------

You should install the C/C++/Fortran compilers to fully support the compilation of MPI applications.

Incompatible Libraries

The available compiler does not match the installed libraries, or compiler paths are set incorrectly. mpi-start should fix the compiler flags (32/64 bits) and set them in the MPI_<COMPILER>_FLAGS where COMPILER is one of MPICC (C), MPICXX (C++), MPIF90 (Fortran 90) or MPIF70 (Fortran 70). Use one of those variables for your compilation.

Execution

mpiexec errors

Some sites have reported errors related to bad usage of OSC Mpiexec. Sample error messages:

error while loading shared libraries: libtorque.so.0: cannot open shared object file: No such file or directory

mpiexec: Error: PBS_JOBID not set in environment.

These errors are due to using the wrong version of Mpiexec for the installed torque, or trying to use this starter in a non PBS site.

`cannot find scheduler`

If mpi-start is not able to detect the batch system being used, it will issue a cannot find scheduler error message and exit with code 3. This is normally due to misconfiguration of the batch system or the Computing Element.

SGE

Support for MPI jobs in SGE requires the configuration of a Parallel Environment and enabling it for submission of jobs from the Computing Element. Current CREAM SGE support selects any parallel environment (uses -pe * option) available. If your job fails to start with a cannot find scheduler error from mpi-start, probably the parallel environment is not properly configured.

eciencia: Middleware/MpiStart/TroubleshootingGuide (last edited 2011-09-20 07:40:38 by enol)