Bug 9601

Summary: Stale fcntl locks left in place on NFS server
Product: File System Reporter: Matthew Gregan [:kinetik] (kinetik)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: CLOSED OBSOLETE    
Severity: normal CC: alan, bfields, brendan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24-rc5-c63a11903 Subsystem:
Regression: No Bisected commit-id:
Attachments: testcase
server's nlm_debug output (trimmed)
client's nlm_debug output (trimmed)

Description Matthew Gregan [:kinetik] 2007-12-18 21:04:32 UTC
Most recent kernel where this bug did not occur: unknown

Distribution: Fedora 8 x86_64

Hardware Environment: x86_64 kernel running in VMWare Fusion with a single CPU configured (real hardware is Core 2 Duo, but probably not relevant)

Software Environment: nfs-utils-1.1.0-6.fc8, nfs-utils-lib-1.1.0-3.fc8

Problem Description: Users reported problems with fcntl() locking of their Firefox profile when home directory (and thus profile) was located on NFS and client was running Linux (mozilla.org bug #318801).  Reports indicated that fcntl() lock attempts would fail in cases where the user was sure the profile should not be locked (i.e. no running Firefox processes with access to that profile).

Investigating this with two Fedora 8 VMs (and originally reproduced with Fedora kernel 2.6.23.6-64.fc8) where one acted as NFS server and the other as client, I am able to reproduce a situation easily where stale fcntl() locks are left on the server.

The testcase attempts to lock a file using fcntl(.., F_SETLK, ...) and exits with rv == 0 if the lock was obtained, rv == 1 if the lock was busy, and rv == 2 on any other error.  The lock is never explicitly unlocked--it is left up to the kernel/NLM to clean up the lock when the process dies.

Steps to reproduce:

0. Check that lockd/rpc.statd are running.
1. Export directory on server.
2. Mount exported directory on client (no special mount options, defaults to NFSv3).
3. Change directory into exported dir on the client.
4. On the client, execute
   while :; do ./a.out; echo $?; done;
5. On the client, execute
   while :; do pkill -9 a.out; sleep 0.01; done

Within a short time (usually under thirty seconds, in my case) the lock acquiring loop started at step #4 will continuously fail because fcntl() returns -1/EAGAIN.

This is also reproducible using F_SETLKW, except that the locker will just wait indefinitely waiting for the lock once the problem occurs.

Once this occurs, attempts to lock the filw with fcntl() on the client or the server fail with -1/EAGAIN.  The lock remains listed in /proc/locks on the server and is never removed:

1: POSIX  ADVISORY  WRITE  128  fd:00:4002402 0 EOF

% ls -i lock-new
4002402 lock-new

I haven't been able to reproduce when a single machine was acting as both the NFS server and client.

Attachments including nlm_debug log excerpts and the testcase coming up.
Comment 1 Matthew Gregan [:kinetik] 2007-12-18 21:05:11 UTC
Created attachment 14116 [details]
testcase
Comment 2 Matthew Gregan [:kinetik] 2007-12-18 21:05:54 UTC
Created attachment 14117 [details]
server's nlm_debug output (trimmed)
Comment 3 Matthew Gregan [:kinetik] 2007-12-18 21:06:13 UTC
Created attachment 14118 [details]
client's nlm_debug output (trimmed)
Comment 4 Matthew Gregan [:kinetik] 2007-12-18 21:07:02 UTC
Maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=229469