Created attachment 38442 [details]
Programs for reproducing
Debian Lenny with kernel 2.6.37-rc3 i686
172.22.0.10 is a server
172.22.0.11 is a client
/etc/exports on a server:
Steps to reproduce:
client# mount -t nfs 172.22.0.10:/srv/test /srv/test
server$ cd /srv/test
server$ ./linlock r file
client$ cd /srv/test
client$ ./linlock w file
fcntl: Resource temporary unavailable
client$ ./lockinfo file
F_UNLCK SEEK_SET start 0 len 0
./linlock tool set read (r) or write (w) lock on a file. ./lockinfo shows what F_GETLK returns. This test shows that a client can not set a write lock on a read lock of server but F_GETLK tells that file is not locked.
The bug does not occur if client with 2.6.21 kernel.
The bug occurs if client with 2.6.22-rc1 and higher.
Created attachment 38452 [details]
[PATCH] NFS: Fix fcntl F_GETLK not reporting some conflicts
Narrowing the search range to 2.6.21..2.6.22-rc1 was very helpful. Please try this patch (for the current HEAD, but applies even to 2.6.22-rc1 with some fuzz).
Thank you for fast fix. With this patch F_GETLK tells that file is locked.
I also found there is other difference between a client and a server. If a lock is set on client F_GETLK on server reports a pid of process which set a lock ([lockd] pid). But if a lock is set on server F_GETLK on client returns pid equal to 0. This difference seems more old (is was exist always?) than bug with lock type because it also exists on 2.6.21 and 2.6.18.
Can it be the reason of the problem? If so, it looks like it should be replaced with
"fl->fl_pid = req->a_res.lock.fl.fl_pid;", but of course there may be some reasons not to do it (setting it to zero looks very strange).
(In reply to comment #3)
Here req->a_res.lock.fl.fl_pid does not really contain a pid - according to the NFS lock manager protocol specs, it is just an unique value generated by the system which requested the lock. In Linux this is an arbitrary number not actually related with a real pid:
And a real pid value would be useless anyway, because there is no standard field in struct flock to return any identification of the machine where that process is running (some systems have such fields, but not Linux). Even worse, returning a remote pid value which happens to match some unrelated local process may be harmful - returning 0 meaning "a remote process which cannot be identified in a reliable and portable way" is much safer.
Thanks for your comment! Now it's clear for me.
(In reply to comment #4)
When a lock is set on a client and fcntl is called on a server F_GETLK returns lockd pid instead of 0. So if pid != 0 then it is not always mean that a lock is set by a local program.
On 2.6.8 F_GETLK always returns lockd pid in case of remote lock. But now we receive either 0 or lockd pid.
fixed in 2.6.37-rc6 by
Author: Sergey Vlasov <firstname.lastname@example.org>
Date: Sun Nov 28 21:04:05 2010 +0000
NFS: Fix fcntl F_GETLK not reporting some conflicts