Bug 9601 - Stale fcntl locks left in place on NFS server
Summary: Stale fcntl locks left in place on NFS server
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Trond Myklebust
Depends on:
Reported: 2007-12-18 21:04 UTC by Matthew Gregan [:kinetik]
Modified: 2012-05-17 15:21 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.24-rc5-c63a11903
Tree: Mainline
Regression: No

testcase (711 bytes, text/plain)
2007-12-18 21:05 UTC, Matthew Gregan [:kinetik]
server's nlm_debug output (trimmed) (5.69 KB, text/plain)
2007-12-18 21:05 UTC, Matthew Gregan [:kinetik]
client's nlm_debug output (trimmed) (38.83 KB, text/plain)
2007-12-18 21:06 UTC, Matthew Gregan [:kinetik]

Description Matthew Gregan [:kinetik] 2007-12-18 21:04:32 UTC
Most recent kernel where this bug did not occur: unknown

Distribution: Fedora 8 x86_64

Hardware Environment: x86_64 kernel running in VMWare Fusion with a single CPU configured (real hardware is Core 2 Duo, but probably not relevant)

Software Environment: nfs-utils-1.1.0-6.fc8, nfs-utils-lib-1.1.0-3.fc8

Problem Description: Users reported problems with fcntl() locking of their Firefox profile when home directory (and thus profile) was located on NFS and client was running Linux (mozilla.org bug #318801).  Reports indicated that fcntl() lock attempts would fail in cases where the user was sure the profile should not be locked (i.e. no running Firefox processes with access to that profile).

Investigating this with two Fedora 8 VMs (and originally reproduced with Fedora kernel where one acted as NFS server and the other as client, I am able to reproduce a situation easily where stale fcntl() locks are left on the server.

The testcase attempts to lock a file using fcntl(.., F_SETLK, ...) and exits with rv == 0 if the lock was obtained, rv == 1 if the lock was busy, and rv == 2 on any other error.  The lock is never explicitly unlocked--it is left up to the kernel/NLM to clean up the lock when the process dies.

Steps to reproduce:

0. Check that lockd/rpc.statd are running.
1. Export directory on server.
2. Mount exported directory on client (no special mount options, defaults to NFSv3).
3. Change directory into exported dir on the client.
4. On the client, execute
   while :; do ./a.out; echo $?; done;
5. On the client, execute
   while :; do pkill -9 a.out; sleep 0.01; done

Within a short time (usually under thirty seconds, in my case) the lock acquiring loop started at step #4 will continuously fail because fcntl() returns -1/EAGAIN.

This is also reproducible using F_SETLKW, except that the locker will just wait indefinitely waiting for the lock once the problem occurs.

Once this occurs, attempts to lock the filw with fcntl() on the client or the server fail with -1/EAGAIN.  The lock remains listed in /proc/locks on the server and is never removed:

1: POSIX  ADVISORY  WRITE  128  fd:00:4002402 0 EOF

% ls -i lock-new
4002402 lock-new

I haven't been able to reproduce when a single machine was acting as both the NFS server and client.

Attachments including nlm_debug log excerpts and the testcase coming up.
Comment 1 Matthew Gregan [:kinetik] 2007-12-18 21:05:11 UTC
Created attachment 14116 [details]
Comment 2 Matthew Gregan [:kinetik] 2007-12-18 21:05:54 UTC
Created attachment 14117 [details]
server's nlm_debug output (trimmed)
Comment 3 Matthew Gregan [:kinetik] 2007-12-18 21:06:13 UTC
Created attachment 14118 [details]
client's nlm_debug output (trimmed)
Comment 4 Matthew Gregan [:kinetik] 2007-12-18 21:07:02 UTC
Maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=229469

Note You need to log in before you can comment on or make changes to this bug.