Latest working kernel version: 2.6.24-rc6-git9 Earliest failing kernel version: 2.6.24-rc6-git10 Distribution: Debian GNU/Linux 4.0/Etch Hardware Environment: x86_64, diskless workstation Software Environment: NFS4(krb5i) /home of users Problem Description: I've noticed there is a bug in the current commit to -rc6-git10 e6e21970baff4845de74584e2efc8c964a55d574 (NFSv4: Fix open_to_lock_owner sequenceid allocation...) Here, the /home of users is shared between machines via nfs4 mounts from server. The code leads to an unusable system within ~10 mins when running some file hungry app like firefox. The app seem to hang, as every other process which accesses an nfs4 mountpoint. The behavior is reproducible here at least on 2 x86_64 boxes. I haven't tried an x86 box yet. Unapplying this commit removes the described effect.
argh.
Created attachment 14366 [details] nfsd4: fix bad seqid on lock request incompatible with open mode
As I said in my email to you about this subject, I'm interested in obtaining the sysrq-T trace when the hang occurs. That said, the attached patch fixes a known bug with sequence ids on the Linux server and might make a difference.
Created attachment 14368 [details] SysRq-t dump The dump is attached. It is a bit huge. I think the offending part is looking like Jan 7 22:57:52 narcissus kernel: firefox-bin S 0000000000000001 0 1714 1727 Jan 7 22:57:52 narcissus kernel: ffff810114c2dbd8 0000000000000046 ffff810114c2db78 ffffffff80256348 Jan 7 22:57:52 narcissus kernel: ffff810114c2a000 ffffffff805b5c8a ffff810114c2a000 ffff810122dd8000 Jan 7 22:57:52 narcissus kernel: ffff810114c2a218 0000000180256516 ffff810009857730 0000000000000292 Jan 7 22:57:52 narcissus kernel: Call Trace: Jan 7 22:57:52 narcissus kernel: [<ffffffff80256348>] mark_held_locks+0x4a/0x6a Jan 7 22:57:52 narcissus kernel: [<ffffffff805b5c8a>] _spin_unlock_irqrestore+0x3f/0x69 Jan 7 22:57:52 narcissus kernel: [<ffffffff805b5c97>] _spin_unlock_irqrestore+0x4c/0x69 Jan 7 22:57:52 narcissus kernel: [<ffffffff8059c81c>] rpc_wait_bit_interruptible+0x22/0x28 Jan 7 22:57:52 narcissus kernel: [<ffffffff805b38de>] __wait_on_bit+0x45/0x77 Jan 7 22:57:52 narcissus kernel: [<ffffffff8059c7fa>] rpc_wait_bit_interruptible+0x0/0x28 Jan 7 22:57:52 narcissus kernel: [<ffffffff8059c7fa>] rpc_wait_bit_interruptible+0x0/0x28 Jan 7 22:57:52 narcissus kernel: [<ffffffff805b397e>] out_of_line_wait_on_bit+0x6e/0x7b Jan 7 22:57:52 narcissus kernel: [<ffffffff8024b194>] wake_bit_function+0x0/0x2a Jan 7 22:57:52 narcissus kernel: [<ffffffff8059c7a0>] __rpc_wait_for_completion_task+0x3a/0x40 Jan 7 22:57:52 narcissus kernel: [<ffffffff80306987>] nfs4_wait_for_completion_rpc_task+0x2a/0x47 Jan 7 22:57:52 narcissus kernel: [<ffffffff80306c2b>] _nfs4_do_setlk+0x1a1/0x205 Jan 7 22:57:52 narcissus kernel: [<ffffffff803073f1>] nfs4_proc_lock+0x309/0x40b Jan 7 22:57:52 narcissus kernel: [<ffffffff802f610d>] do_setlk+0x61/0xbb Jan 7 22:57:52 narcissus kernel: [<ffffffff802f640d>] nfs_lock+0x1f3/0x204 Jan 7 22:57:52 narcissus kernel: [<ffffffff80293f22>] locks_alloc_lock+0x15/0x17 Jan 7 22:57:52 narcissus kernel: [<ffffffff802943b7>] vfs_lock_file+0x1e/0x2d Jan 7 22:57:52 narcissus kernel: [<ffffffff80294ec5>] fcntl_setlk+0x123/0x246 Jan 7 22:57:52 narcissus kernel: [<ffffffff8028699c>] fget+0xc0/0x104 Jan 7 22:57:52 narcissus kernel: [<ffffffff802911c5>] sys_fcntl+0x2f9/0x37b Jan 7 22:57:52 narcissus kernel: [<ffffffff8020ba2a>] tracesys+0xdc/0xe1
Created attachment 14371 [details] NFSv4: Give the lock stateid its own sequence queue
One more patch. This one ought to fix the client regression. The current code shares a sequence queue for open and lock requests, so that OPEN and LOCK are always serialised. That works fine, except when we try to grab both an open_seqid and a lock_seqid: only one of those two can be at the head of the queue...
The first patch (#14366) didn't resolve the problem here. Trying the #14371 now.
It works with both fixes applied. Thanks a lot for the fast resolution! Do you want me to test the last patch against vanilla -rc7?
There shouldn't be any changes in -rc7 compared to what I understand you've been testing, but if you have the time, then that would certainly be useful. Thanks!
OK, I will prepare following condition: Server: 2.6.24-rc7-git1 (x86_64) no patches Client: 2.6.24-rc7-git1 (x86_64) && patch #14371 I'll try to add another i386 client with same patchlevel.
It seem to work fine here in the configuration mentioned above. I think this can be closed. Will both patches reach mainline before the .24 release?
I've pushed the second patch to Linus. The first should really be up to Bruce Fields, since it is a server bug. AFAICS, he has queued it for 2.6.25...
Patch in the Linus' tree: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d0dc3701cb46f73cf8ca393f62e325065b0bbd03
"The first should really be up to Bruce Fields, since it is a server bug. AFAICS, he has queued it for 2.6.25..." Yes. I'm assuming it isn't urgent enough to be pushed to 2.6.24 at this point.