welcome: please sign in
location: Diff for "Middleware/MpiStart/TroubleshootingGuide"
Differences between revisions 1 and 8 (spanning 7 versions)
Revision 1 as of 2010-11-15 16:33:08
Size: 4076
Editor: enol
Comment:
Revision 8 as of 2011-09-20 07:18:10
Size: 3595
Editor: enol
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Check the [[../UserDocumentation|User Documentation]] page for more information on MPI-Start. <<TableOfContents(3)>>
Line 5: Line 5:
== ./mpi.sh: line 34: /opt/i2g/bin/mpi-start: No such file or directory ==

=== Cause ===
Mpi-start is not installed

=== Solution ===
Install MPI-START

== mpiexec: error while loading shared libraries: libtorque.so.0: cannot open shared object file: No such file or directory ==

=== Cause ===
The installed mpiexec version is wrong for installed version of torque.

=== Solution ===
Install the correct mpiexec version

== mpiexec: cannot connect to local mpd ==

=== Cause ===
Unknown

=== Solution ===
Unknown

== skipping incompatible /XXXX/libmpich.a when searching for -lmpich ==

=== Cause ===
Available compiler do not match the installed libraries, or compiler paths incorrectly set.

=== Solution ===
Install newer version of mpi-start that should fix the compiler flags (32/64 bits) or set correct compiler flags.

== mpiexec: Error: PBS_JOBID not set in environment. ==

=== Cause ===
pbs/torque mpiexec installed in a non PBS site

=== Solution ===
Install the correct mpiexec or remove the current one

== I2G_MPI_START variable is not set! ==

=== Cause ===
The environment variable I2G_MPI_START is not set although announced in the site BDII

=== Solution ===
Set the environment variable to the correct path

== mpicc: command not found ==

=== Cause ===
MPI compiler not available

=== Solution ===
Install the mpi compiler

== /opt/i2g/bin/../etc/mpi-start/openmpi.mpi: line 70: MPI_SPECIFIC_PARAMS+=-x X509_USER_PROXY --prefix /opt/i2g/openmpi : No such file or directory ==

=== Cause ===
An old version of mpi-start is installed (pre 0.0.54)

=== Solution ===
Install a newer version.

== which: no mpiexec in … ==

=== Cause ===
Environment not set correctly

=== Solution ===
Set the MPI_<flavour>_PATH correctly.

== mpiexec: Warning: task 0 exited before completing MPI startup. mpiexec: Warning: task 1 was never spawned due to earlier errors. ==

=== Cause ===
Unknown

=== Solution ===
Unknown

== Timeout when executing test MPI-sft-mpich after 600 seconds! ==
=== Cause ===
Unknown

=== Solution ===
Unknown

== /opt/mpich-1.2.7p1//bin/mpicc: line 326: cc: command not found ==
=== Cause ===
Compiler not installed

=== Solution ===
Install the development related RPMs (e.g. opempi-devel, mpich2-devel)

== cannot find scheduler ==
=== Cause ===
Jobmanager for SGE not configured properly to submit jobs to a PE.

=== Solution ===
Configure the jobmanager

== ... mpi.h: No such file or directory ==

=== Cause ===
Most likely, devel packages not installed.

=== Solution ===
Install devel packages
Check the [[../|mpi-start]] page for more information.
Line 115: Line 8:
== CpuNumber: attribute cannot be specified with non MPICH jobs ==
=== Cause ===
Unpatched SL4 version of CREAM-CE installed.
== Configuration ==
Line 119: Line 10:
=== Solution ===
Bug #56762: since gLite 3.1 Update 56 (patch #3259) it is not possible to specify NodeNumber or CpuNumber in the JDL when JobType is Normal.
=== yaim plugin does not publish MPI-* tags into the RunTimeEnvironment ===
Line 122: Line 12:
Waiting for the patch fixing this problem, please apply the following workaround:
* Replace $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/glite-jdl-api-java.jarwith
this file: [[http://grid.pd.infn.it/cream/patch-glite-jdl-api-java/glite-jdl-api-java.jar|glite-jdl-api-java.jar]]
* Restart tomcat
 1. Check that you are using the `MPI_CE` node type and that it is the '''first''' node type in the command line.
 1. Check the yaim log for messages like `Added <FLAVOUR> to set to CE_RUNTIMEENV`, if they does not appear, probably you have not enabled any flavour in your yaim profile.
Line 127: Line 15:
This problem doesn't affect the CREAM CE version for gLite 3.2/sl5 == Compilation ==
Line 129: Line 17:
== mpiexec was unable to launch the specified application as it could not find an executable: ==
=== Cause ===
Open MPI mpiexec used instead of mpich's
=== Compiler not found ===
Line 133: Line 19:
=== Solution ===
Set the environment variables to the appropriate mpich location
While it is advised that sites supporting MPI do install the MPI compiler, it may not available at all sites. Some sites do have installed the -devel packages but do not have the proper compiler (`gcc`) installed. In the case of Open MPI a message like this is shown:

{{{
--------------------------------------------------------------------------
The Open MPI wrapper compiler was unable to find the specified compiler
gcc in your PATH.

Note that this compiler was either specified at configure time or in
one of several possible environment variables.
--------------------------------------------------------------------------
}}}

You should install the C/C++/Fortran compilers to fully support the compilation of MPI applications.
Line 137: Line 34:
== /opt/mpiexec-0.82/bin/mpiexec: No such file or directory ==
=== Cause ===
The MPI_XXXX_MPIEXEC variable is defined to a mpiexec version, but the binary is not installed or cannot be found.
=== Incompatible Libraries ===
Line 141: Line 36:
=== Solution ===
Set the variable to the correct location of mpiexec, install the package, or check that the filesystem where the binary is located is correctly mounted at the Worker Nodes.
The available compiler does not match the installed libraries, or compiler paths are set incorrectly. mpi-start should fix the compiler flags (32/64 bits) and set them in the `MPI_<COMPILER>_FLAGS` where COMPILER is one of `MPICC` (C), `MPICXX` (C++), `MPIF90` (Fortran 90) or `MPIF70` (Fortran 70). Use one of those variables for your compilation.

== Execution ==

=== mpiexec errors ===

Some sites have reported errors related to bad usage of [[http://www.osc.edu/~djohnson/mpiexec/index.php|OSC Mpiexec]]. Sample error messages:
{{{
error while loading shared libraries: libtorque.so.0: cannot open shared object file: No such file or directory
}}}
{{{
mpiexec: Error: PBS_JOBID not set in environment.
}}}

These errors are due to using the wrong version of Mpiexec for the installed torque, or trying to use this starter in a non PBS site.

=== cannot find scheduler ===

If mpi-start is not able to detect the batch system being used, it will issue a `cannot find scheduler` error message and exit with code 3. This is normally due to misconfiguration of the batch system or the Computing Element.

==== SGE ====

Support for MPI jobs in SGE requires the configuration of a Parallel Environment and enabling it for submission of jobs from the Computing Element. Current CREAM SGE support selects any parallel environment (uses `-pe *` option) available. If your job fails to start with a `cannot find scheduler` error from mpi-start, probably the parallel environment is not properly configured.

=== File transfer ===

mpi-start tries to copy files to remote hosts if it does not find a shared filesystem. The shared filesystem detection is limited to the working directory, if your site uses a shared space which is not where the job starts, mpi-start can be configured to use that area with the `MPI_SHARED_HOME` and `MPI_SHARED_HOME_PATH`. Check the manual for details.

The fail-over method for copying the files is ssh (scp). This requires passwordless ssh configured between the nodes. If mpi-start is not able to login into the remote host it will display `failed to create directory on remote machine` error message. Check that passwordless ssh is working between your nodes if you get that message.

Troubleshooting Guide

Check the mpi-start page for more information.

Configuration

yaim plugin does not publish MPI-* tags into the RunTimeEnvironment

  1. Check that you are using the MPI_CE node type and that it is the first node type in the command line.

  2. Check the yaim log for messages like Added <FLAVOUR> to set to CE_RUNTIMEENV, if they does not appear, probably you have not enabled any flavour in your yaim profile.

Compilation

Compiler not found

While it is advised that sites supporting MPI do install the MPI compiler, it may not available at all sites. Some sites do have installed the -devel packages but do not have the proper compiler (gcc) installed. In the case of Open MPI a message like this is shown:

--------------------------------------------------------------------------
The Open MPI wrapper compiler was unable to find the specified compiler
gcc in your PATH.

Note that this compiler was either specified at configure time or in
one of several possible environment variables.
--------------------------------------------------------------------------

You should install the C/C++/Fortran compilers to fully support the compilation of MPI applications.

Incompatible Libraries

The available compiler does not match the installed libraries, or compiler paths are set incorrectly. mpi-start should fix the compiler flags (32/64 bits) and set them in the MPI_<COMPILER>_FLAGS where COMPILER is one of MPICC (C), MPICXX (C++), MPIF90 (Fortran 90) or MPIF70 (Fortran 70). Use one of those variables for your compilation.

Execution

mpiexec errors

Some sites have reported errors related to bad usage of OSC Mpiexec. Sample error messages:

error while loading shared libraries: libtorque.so.0: cannot open shared object file: No such file or directory

mpiexec: Error: PBS_JOBID not set in environment.

These errors are due to using the wrong version of Mpiexec for the installed torque, or trying to use this starter in a non PBS site.

cannot find scheduler

If mpi-start is not able to detect the batch system being used, it will issue a cannot find scheduler error message and exit with code 3. This is normally due to misconfiguration of the batch system or the Computing Element.

SGE

Support for MPI jobs in SGE requires the configuration of a Parallel Environment and enabling it for submission of jobs from the Computing Element. Current CREAM SGE support selects any parallel environment (uses -pe * option) available. If your job fails to start with a cannot find scheduler error from mpi-start, probably the parallel environment is not properly configured.

File transfer

mpi-start tries to copy files to remote hosts if it does not find a shared filesystem. The shared filesystem detection is limited to the working directory, if your site uses a shared space which is not where the job starts, mpi-start can be configured to use that area with the MPI_SHARED_HOME and MPI_SHARED_HOME_PATH. Check the manual for details.

The fail-over method for copying the files is ssh (scp). This requires passwordless ssh configured between the nodes. If mpi-start is not able to login into the remote host it will display failed to create directory on remote machine error message. Check that passwordless ssh is working between your nodes if you get that message.

eciencia: Middleware/MpiStart/TroubleshootingGuide (last edited 2011-09-20 07:40:38 by enol)