Most recent kernel where this bug did *NOT* occur: 2.6.20.1 Distribution: Suse 10.1 Hardware Environment: Dell or PC with Intel Dual Core 64 Processors Software Environment: kernel 2.6.20.1 with revised patch fron http://bugzilla.kernel.org/show_bug.cgi?id=7916 Problem Description: File locking over NFS works, but needs about 30 seconds Steps to reproduce: use pearl script: #!/usr/bin/perl use Fcntl ':flock'; # import LOCK_* constants print("Just before opening file.\n"); open(LOCKFILE,">testlockfile") or die "Error: Could not write to lock file: testlockfile: $!\n"; print("Just before locking file.\n"); flock(LOCKFILE,LOCK_EX); print("Just before unlocking file.\n"); flock(LOCKFILE,LOCK_UN); print("All done. File locked and unlocked.\n"); unlink("testlockfile"); exit(0);
Here is some more information about the problem: we are running a Dell cluster and also other Pc's with Intel Dual Core processors with a Suse 10.1 distribution. On all platforms we noticed the same behaviour: The kernel 2.6.18 in the Suse distribution produces wrong data on the disc when written over nfs. After updating to kernel 2.6.20.1 this problem was resolved, but under some nfs-load a kernel bug (kernel BUG at fs/inode.c) regularly occured so that we applied the revised patch from http://bugzilla.kernel.org/show_bug.cgi?id=7916 now the system and nfs seems to run stable, but unfortunately file locking is now extremely slow so that applications run into timouts and we receive the following kernel messages (from dmesg): ... statd: server localhost not responding, timed out lockd: cannot monitor HOSTNAME lockd: failed to monitor HOSTNAME After some googling, I found the following perl script to test file locking: #!/usr/bin/perl use Fcntl ':flock'; # import LOCK_* constants print("Just before opening file.\n"); open(LOCKFILE,">testlockfile") or die "Error: Could not write to lock file: testlockfile: $!\n"; print("Just before locking file.\n"); flock(LOCKFILE,LOCK_EX); print("Just before unlocking file.\n"); flock(LOCKFILE,LOCK_UN); print("All done. File locked and unlocked.\n"); unlink("testlockfile"); exit(0); running this on an nfs mounted file system gives the following result: time perlscript Just before opening file. Just before locking file. Just before unlocking file. All done. File locked and unlocked. real 0m30.068s user 0m0.012s sys 0m0.000s so it takes about 30 seconds to do the file locking. on other machines running an older kernel 2.6.18 but the same hardware and software configuration, I get the same messages but the times were: real 0m0.297s user 0m0.028s sys 0m0.012s any hints, what is going wrong? here are the details of the system: uname -a Linux HOSTNAME 2.6.20.1 #4 SMP Tue Mar 6 11:57:25 CET 2007 x86_64 x86_64 x86_64 GNU/Linux
I think what you're seeing is related to the kernel statd support in Suse (mea culpa). You upgraded to mainline, which doesn't have that (yet), so you also need to run the user space rpc.statd (which Suse doesn't ship). You probably need to get it from nfs-utils and build and install it yourself.
So I installed: libgssapi-0.10.tar.gz librpcsecgss-0.14 nfs-utils-1.0.10.tar.gz after starting "statd" from nfs-utils the problem vanishes. So this seems to be a Suse related problem. Thank you Olaf Kirch for the fast response. Although I think that Suse should have supplied a kernel update concerning the production of wrong data over nfs, which I mentioned in my problem description. That is really something which has caused quite some headache.
Closing bug, since this appears to be a SuSE distribution issue rather than a mainline kernel breakage.