Bug 6466 - MPICH applications crashed or produces wrong results with vanilla kernels, works fine with 2.6 -ac series or RHEL and compatible kernels
Summary: MPICH applications crashed or produces wrong results with vanilla kernels, wo...
Status: CLOSED OBSOLETE
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-04-30 07:03 UTC by Dimitris Zilaskos
Modified: 2012-05-12 01:36 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.16, 2.6.15.6, 2.6.14.3, 2.6.9, 2.6.9-mm1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
error log (143.06 KB, application/octet-stream)
2007-10-03 07:33 UTC, Dimitris Zilaskos
Details

Description Dimitris Zilaskos 2006-04-30 07:03:46 UTC
Most recent kernel where this bug did not occur:2.6.11-ac7

Distribution:Tested Scientific Linux 3.0.4 and 4.2 (RHEL compatible)

Hardware Environment: Tried on Dual core dual cpu opteron, dual xeon, dual p3

Software Environment: The application has  been compiled with intel
c/c++/fortran compilers version 9

Problem Description:

Hello,

I am using WRF model (http://www.wrf-model.org/index.php) with mpich1.7 and
intel fortran/c/c++ 9 compilers(declare -x CC=icc,F90=ifort,CXX=icpc,RSH=ssh) on
Scientific Linux 4.2 (RHEL compatible). When using more than 1 cpus, the model
will either crash or produce wrong results or once in 20 attemps it will
complete successfully.
The issue occurs when using vanilla 2.6 kernels(also tested mm kernels). If I
use RHEL supplied kernels or -ac series I have no problem.

Steps to reproduce:

Install mpich 1.2.7p1 on a rhel 4 compatible system. Compile it with intel
c/fortran. You might also be able to run the binary with gcc compiled mpich ( I
tested that on Scientific Linux 3.0.5 with 10 Xeon CPUS and it worked)
obtain http://tassadar.physics.auth.gr/~dzila/em_real.tar.bz2

It is the part of the model I use already compiled with intel c/fortran

uncompress , cd to em_real,ulimit -s unlimited and execute mpirun -np 4
./wrf.exe (or -np 2 etc). You might need to supply a text file containing the
hostname of your system and feed it to mpich with -machinefile ./box.txt. If a
crash occurs after some time or a fort.98 file with errors is created or the
message WOULD GO OFF appears in the rsl.* log files, then the problem has occured.

If you want to compile the model on your own,then
a)Install netcdf from (compile with intel
c/fortran)http://www.unidata.ucar.edu/software/netcdf/
Just declare -x CC=icc,F90=ifort etc...,configure,make,make install
b)Install WRF2.1.2 again with intel c/fortran.Use option 11 from the configure.
Then execute compile em_real. (this takes more than 1 hour)

in the directory test/em_real, copy all files from my tarball apart from the
executables.
Run as I mentioned before.

The tarball I provide has been executed successfull with up to 10 Xeon cpus on
RHEL kernels with gcc compiled mpich. WRF needs commercial fortran installed and
netcdf must be compiled with the same compiler.

I have also reported the problem to intel and they are investigating.
Comment 1 Natalie Protasevich 2007-08-26 00:42:22 UTC
Any updates on this problem, does it still exist with new kernels?
If so, you should provide more information on the crash (traces, error messages) The we can categorize the problem appropriately.
Thanks.
Comment 2 Adrian Bunk 2007-09-20 17:11:16 UTC
Please reopen this bug if
- it is still present with kernel 2.6.22 and
- you can provide the requested information.
Comment 3 Dimitris Zilaskos 2007-10-03 06:39:25 UTC
confirmed on 2.6.22.9

mpirun -np 2 ./wrf.exe
 starting wrf task            0  of            2
 starting wrf task            1  of            2
Killed by signal 2.
forrtl: error (69): process interrupted (SIGINT)

i would need your help in order how to provide the additional data requested

sorry for the late reply. I serve in the army for the next 7 months
Comment 4 Dimitris Zilaskos 2007-10-03 07:33:12 UTC
Created attachment 13031 [details]
error log

Note You need to log in before you can comment on or make changes to this bug.