Most recent kernel where this bug did not occur:2.6.11-ac7 Distribution:Tested Scientific Linux 3.0.4 and 4.2 (RHEL compatible) Hardware Environment: Tried on Dual core dual cpu opteron, dual xeon, dual p3 Software Environment: The application has been compiled with intel c/c++/fortran compilers version 9 Problem Description: Hello, I am using WRF model (http://www.wrf-model.org/index.php) with mpich1.7 and intel fortran/c/c++ 9 compilers(declare -x CC=icc,F90=ifort,CXX=icpc,RSH=ssh) on Scientific Linux 4.2 (RHEL compatible). When using more than 1 cpus, the model will either crash or produce wrong results or once in 20 attemps it will complete successfully. The issue occurs when using vanilla 2.6 kernels(also tested mm kernels). If I use RHEL supplied kernels or -ac series I have no problem. Steps to reproduce: Install mpich 1.2.7p1 on a rhel 4 compatible system. Compile it with intel c/fortran. You might also be able to run the binary with gcc compiled mpich ( I tested that on Scientific Linux 3.0.5 with 10 Xeon CPUS and it worked) obtain http://tassadar.physics.auth.gr/~dzila/em_real.tar.bz2 It is the part of the model I use already compiled with intel c/fortran uncompress , cd to em_real,ulimit -s unlimited and execute mpirun -np 4 ./wrf.exe (or -np 2 etc). You might need to supply a text file containing the hostname of your system and feed it to mpich with -machinefile ./box.txt. If a crash occurs after some time or a fort.98 file with errors is created or the message WOULD GO OFF appears in the rsl.* log files, then the problem has occured. If you want to compile the model on your own,then a)Install netcdf from (compile with intel c/fortran)http://www.unidata.ucar.edu/software/netcdf/ Just declare -x CC=icc,F90=ifort etc...,configure,make,make install b)Install WRF2.1.2 again with intel c/fortran.Use option 11 from the configure. Then execute compile em_real. (this takes more than 1 hour) in the directory test/em_real, copy all files from my tarball apart from the executables. Run as I mentioned before. The tarball I provide has been executed successfull with up to 10 Xeon cpus on RHEL kernels with gcc compiled mpich. WRF needs commercial fortran installed and netcdf must be compiled with the same compiler. I have also reported the problem to intel and they are investigating.
Any updates on this problem, does it still exist with new kernels? If so, you should provide more information on the crash (traces, error messages) The we can categorize the problem appropriately. Thanks.
Please reopen this bug if - it is still present with kernel 2.6.22 and - you can provide the requested information.
confirmed on 2.6.22.9 mpirun -np 2 ./wrf.exe starting wrf task 0 of 2 starting wrf task 1 of 2 Killed by signal 2. forrtl: error (69): process interrupted (SIGINT) i would need your help in order how to provide the additional data requested sorry for the late reply. I serve in the army for the next 7 months
Created attachment 13031 [details] error log