Most recent kernel where this bug did not occur: unknown
Distribution: Fedora 8 x86_64
Hardware Environment: x86_64 kernel running in VMWare Fusion with a single CPU configured (real hardware is Core 2 Duo, but probably not relevant)
Software Environment: nfs-utils-1.1.0-6.fc8, nfs-utils-lib-1.1.0-3.fc8
Problem Description: Users reported problems with fcntl() locking of their Firefox profile when home directory (and thus profile) was located on NFS and client was running Linux (mozilla.org bug #318801). Reports indicated that fcntl() lock attempts would fail in cases where the user was sure the profile should not be locked (i.e. no running Firefox processes with access to that profile).
Investigating this with two Fedora 8 VMs (and originally reproduced with Fedora kernel 126.96.36.199-64.fc8) where one acted as NFS server and the other as client, I am able to reproduce a situation easily where stale fcntl() locks are left on the server.
The testcase attempts to lock a file using fcntl(.., F_SETLK, ...) and exits with rv == 0 if the lock was obtained, rv == 1 if the lock was busy, and rv == 2 on any other error. The lock is never explicitly unlocked--it is left up to the kernel/NLM to clean up the lock when the process dies.
Steps to reproduce:
0. Check that lockd/rpc.statd are running.
1. Export directory on server.
2. Mount exported directory on client (no special mount options, defaults to NFSv3).
3. Change directory into exported dir on the client.
4. On the client, execute
while :; do ./a.out; echo $?; done;
5. On the client, execute
while :; do pkill -9 a.out; sleep 0.01; done
Within a short time (usually under thirty seconds, in my case) the lock acquiring loop started at step #4 will continuously fail because fcntl() returns -1/EAGAIN.
This is also reproducible using F_SETLKW, except that the locker will just wait indefinitely waiting for the lock once the problem occurs.
Once this occurs, attempts to lock the filw with fcntl() on the client or the server fail with -1/EAGAIN. The lock remains listed in /proc/locks on the server and is never removed:
1: POSIX ADVISORY WRITE 128 fd:00:4002402 0 EOF
% ls -i lock-new
I haven't been able to reproduce when a single machine was acting as both the NFS server and client.
Attachments including nlm_debug log excerpts and the testcase coming up.
Created attachment 14116 [details]
Created attachment 14117 [details]
server's nlm_debug output (trimmed)
Created attachment 14118 [details]
client's nlm_debug output (trimmed)
Maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=229469