Bug 16277

Summary: Open mode not correctly conserved on nfsd in some rare cases
Product: File System Reporter: Pierre Bourdon (pbourdon)
Component: NFSAssignee: bfields
Status: RESOLVED CODE_FIX    
Severity: normal CC: matt, trondmy
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35-rc3 Subsystem:
Regression: No Bisected commit-id:
Attachments: bzip2 of a tcpdump when executing the testcase

Description Pierre Bourdon 2010-06-23 14:36:51 UTC
In some rare cases, the open mode of a file descriptor is overwritten on the NFS server, leading to a lockup in a client process and a large bandwidth consumption.

I have a program reproducing the bug on a NFSv4 mount. Mount the NFSv4 server on /nfs, create a file named /nfs/testfile (content does not matter, it can be empty) and run this program :

#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(void)
{
	struct flock rdlock = { F_RDLCK, SEEK_SET, 0, 0, 0 };
	struct flock wrlock = { F_WRLCK, SEEK_SET, 0, 0, 0 };
	char buff[4096] = { 0 };
	int fd;

	fd = open("/nfs/testfile", O_RDONLY);
	fcntl(fd, F_SETLK, &rdlock);

	fd = open("/nfs/testfile", O_RDWR);
	fcntl(fd, F_SETLK, &wrlock);
	write(fd, buff, sizeof buff);
	close(fd);
	return 0;
}

Another way to reproduce this bug is to branch a bazaar repository on a NFS mount then launch "bzr status" (that's how I found this bug in the first place).

When this program is launched, it will lock and become unkillable (does not react to SIGINT or SIGKILL). Using strace shows us that it locks in the "close" syscall, but the problem comes from the "write" before.

Using tcpdump, I dumped the NFS traffic between client and server. The client tries to send the "WRITE" command in a loop and is answered a NFS4ERR_OPENMODE by the server. This error should not happen. Moreover, if an error happen the client should return an error to the calling process instead of becoming unkillable and using all the bandwidth sending NFS commands.
Comment 1 Trond Myklebust 2010-06-23 17:25:07 UTC
Reassigning to Bruce, since this appears at first sight to be a server issue.

Can you please post the binary tcpdump as an attachment? I'm curious to see what the server is replying to the second lock request there, and also whether or not the client is sending the correct stateid.

Thanks!
  Trond
Comment 2 Pierre Bourdon 2010-06-24 10:00:36 UTC
Hi Trond,

I'm attaching the bzip2 of the tcpdump.

Thanks,
Pierre
Comment 3 Pierre Bourdon 2010-06-24 10:01:30 UTC
Created attachment 26927 [details]
bzip2 of a tcpdump when executing the testcase
Comment 4 Trond Myklebust 2010-06-24 12:34:00 UTC
Thanks Pierre! This does indeed look like a server bug.

As far as I can see from your tcpdump, the client is doing the right thing, first upgrading the OPEN stateid to a read/write stateid, then upgrading the LOCK stateid to a WRITE lock, before using the latter in the WRITE request.

I'm therefore assuming that Bruce will deal with this...
Comment 5 Pierre Bourdon 2010-06-24 13:12:00 UTC
By the way, should the client really loop infinitely while keeping the process uninterruptible when the server fails ? This can be a problem as NFS is using all the bandwidth sending and receiving messages and this may also overload the NFS server.

If you think this is indeed a bug, I can open another bug report if you like.
Comment 6 Trond Myklebust 2010-06-24 13:37:56 UTC
Yes. Normally, if the NFS client gets NFS4ERR_OPENMODE, it should attempt to recover that stateid before retrying the request.

As far as I can see from the tcpdump, the state manager is indeed being invoked correctly, but after testing the lease validity, it isn't trying to replay the OPEN and LOCK. I'll try to look into why...
Comment 7 bfields 2010-06-24 16:21:19 UTC
Yes, this is a known problem (though the testcase is helpful, thanks).  I'll see if I can get out a test patch tomorrow.

(The server has always attempted to reuse the old file descriptor on an open upgrade, doing a poor imitation of an open upgrade with get_write_access(), etc.--but this is fragile.  I've been meaning for a long time to replace the single fd in the open stateid by a set of (each possibly NULL) read/write/read-write fd's.  It should be easy....)
Comment 8 Pierre Bourdon 2010-07-01 17:49:36 UTC
Any news on this issue ? :)
Comment 9 Pierre Bourdon 2010-07-02 15:29:02 UTC
Hi,

Trond, do you know if there is a way to kill processes blocked by this bug ? Stopping the NFS server is an option we could afford if it helps killing the rogue process.

Currently, we are forced to reboot our server when someone use "bzr status". This is obviously a huge problem.

Regards,
Comment 10 bfields 2010-07-07 16:20:31 UTC
Looking at this....  I'm trying to figure out which open mode to check against exactly: if an open is upgraded or downgraded after the initial lock, and a new lock is done using the lock stateid acquired before the upgrade/downgrade, what should happen?

The client expects it to use the upgraded/downgraded open, so I guess the server should use the latest file mode (/file descriptors) associated with the open stateid from which the lock stateid was originally derived.  I'm not completely sure of that, though.
Comment 11 Trond Myklebust 2010-07-07 17:20:00 UTC
The spec just says that "The client attempted a READ, WRITE, LOCK or SETATTR operation not sanctioned by the stateid passed". When we pass a lock stateid, then that does not carry an open access mode: the open access mode is carried by the open stateid. I'd therefore argue that the server needs to look up the open stateid and check the access modes on that...

BTW: I'm also wondering if there is an omission in the spec: as far as I can see, OPEN_DOWNGRADE needs to be able to return NFS4ERR_LOCKS_HELD.
Comment 12 bfields 2010-07-07 18:02:44 UTC
"I'd therefore argue that the server needs to look up the
open stateid and check the access modes on that..."

OK.  I don't think I *need* to check the fd open mode against the lock type--I figure that check can safely be left to clients, if they care.  So all I care about is ensuring that the server doesn't unnecessarily reject locks.

So for this to work I think I just need to assume that a client doesn't expect to be able to, say, do an exclusive lock with a lockstateid derived from an open stateid other than the one which got the file open.

That sounds like a crazy thing to do, but as far as I can tell the open owner is meaningless in the spec except for sequencing, so perhaps it's legit.

Another alternative would be for the server to just keep e.g. a single WRONLY file descriptor for a given inode regardless of how many times it's opened.  Then as long as somebody has the file open for write (and the locker has adequate permissions) the exclusive lock will succeed.

I don't know if there's some risk to sharing the file descriptors in that way.

"as far as I can see, OPEN_DOWNGRADE needs to be able to return NFS4ERR_LOCKS_HELD."

Hm.  Probably.  But aren't most unix-y servers going to just destroy locks automatically on close, and most windows-y servers going to allow exclusive locks on files open only for read?  In which case the case may never arise.
Comment 13 bfields 2010-08-01 17:49:44 UTC
Test program no longer hangs for me after patches posted to linux-nfs http://thread.gmane.org/gmane.linux.nfs/34027/focus=34048
Comment 14 Pierre Bourdon 2010-08-02 11:25:29 UTC
(Bugzilla is down here, replying by email)

On Sun, 1 Aug 2010 17:49:47 GMT, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16277
> 
> --- Comment #13 from bfields@fieldses.org  2010-08-01 17:49:44 ---
> Test program no longer hangs for me after patches posted to linux-nfs
> http://thread.gmane.org/gmane.linux.nfs/34027/focus=34048

It looks like these patches fixed the problem. I tested with both the
simple
testcase and "bzr status" on NFS, and both now work fine.

Thanks a lot!

Regards,
Comment 15 bfields 2010-08-02 14:27:19 UTC
Thanks for the confirmation.
Comment 16 bfields 2010-08-21 13:34:59 UTC
*** Bug 16671 has been marked as a duplicate of this bug. ***