Bug 9712
Summary: | BUG in current NFS4 code makes the system unusable | ||
---|---|---|---|
Product: | File System | Reporter: | Puzin, Dimitri (bugs) |
Component: | NFS | Assignee: | Trond Myklebust (trondmy) |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | akpm, bfields, bunk, gentuu, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24-rc6-git10 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 9243 | ||
Attachments: |
nfsd4: fix bad seqid on lock request incompatible with open mode
SysRq-t dump NFSv4: Give the lock stateid its own sequence queue |
Description
Puzin, Dimitri
2008-01-07 23:41:20 UTC
argh. Created attachment 14366 [details]
nfsd4: fix bad seqid on lock request incompatible with open mode
As I said in my email to you about this subject, I'm interested in obtaining the sysrq-T trace when the hang occurs. That said, the attached patch fixes a known bug with sequence ids on the Linux server and might make a difference. Created attachment 14368 [details]
SysRq-t dump
The dump is attached. It is a bit huge. I think the offending part is looking like
Jan 7 22:57:52 narcissus kernel: firefox-bin S 0000000000000001 0 1714 1727
Jan 7 22:57:52 narcissus kernel: ffff810114c2dbd8 0000000000000046 ffff810114c2db78 ffffffff80256348
Jan 7 22:57:52 narcissus kernel: ffff810114c2a000 ffffffff805b5c8a ffff810114c2a000 ffff810122dd8000
Jan 7 22:57:52 narcissus kernel: ffff810114c2a218 0000000180256516 ffff810009857730 0000000000000292
Jan 7 22:57:52 narcissus kernel: Call Trace:
Jan 7 22:57:52 narcissus kernel: [<ffffffff80256348>] mark_held_locks+0x4a/0x6a
Jan 7 22:57:52 narcissus kernel: [<ffffffff805b5c8a>] _spin_unlock_irqrestore+0x3f/0x69
Jan 7 22:57:52 narcissus kernel: [<ffffffff805b5c97>] _spin_unlock_irqrestore+0x4c/0x69
Jan 7 22:57:52 narcissus kernel: [<ffffffff8059c81c>] rpc_wait_bit_interruptible+0x22/0x28
Jan 7 22:57:52 narcissus kernel: [<ffffffff805b38de>] __wait_on_bit+0x45/0x77
Jan 7 22:57:52 narcissus kernel: [<ffffffff8059c7fa>] rpc_wait_bit_interruptible+0x0/0x28
Jan 7 22:57:52 narcissus kernel: [<ffffffff8059c7fa>] rpc_wait_bit_interruptible+0x0/0x28
Jan 7 22:57:52 narcissus kernel: [<ffffffff805b397e>] out_of_line_wait_on_bit+0x6e/0x7b
Jan 7 22:57:52 narcissus kernel: [<ffffffff8024b194>] wake_bit_function+0x0/0x2a
Jan 7 22:57:52 narcissus kernel: [<ffffffff8059c7a0>] __rpc_wait_for_completion_task+0x3a/0x40
Jan 7 22:57:52 narcissus kernel: [<ffffffff80306987>] nfs4_wait_for_completion_rpc_task+0x2a/0x47
Jan 7 22:57:52 narcissus kernel: [<ffffffff80306c2b>] _nfs4_do_setlk+0x1a1/0x205
Jan 7 22:57:52 narcissus kernel: [<ffffffff803073f1>] nfs4_proc_lock+0x309/0x40b
Jan 7 22:57:52 narcissus kernel: [<ffffffff802f610d>] do_setlk+0x61/0xbb
Jan 7 22:57:52 narcissus kernel: [<ffffffff802f640d>] nfs_lock+0x1f3/0x204
Jan 7 22:57:52 narcissus kernel: [<ffffffff80293f22>] locks_alloc_lock+0x15/0x17
Jan 7 22:57:52 narcissus kernel: [<ffffffff802943b7>] vfs_lock_file+0x1e/0x2d
Jan 7 22:57:52 narcissus kernel: [<ffffffff80294ec5>] fcntl_setlk+0x123/0x246
Jan 7 22:57:52 narcissus kernel: [<ffffffff8028699c>] fget+0xc0/0x104
Jan 7 22:57:52 narcissus kernel: [<ffffffff802911c5>] sys_fcntl+0x2f9/0x37b
Jan 7 22:57:52 narcissus kernel: [<ffffffff8020ba2a>] tracesys+0xdc/0xe1
Created attachment 14371 [details]
NFSv4: Give the lock stateid its own sequence queue
One more patch. This one ought to fix the client regression. The current code shares a sequence queue for open and lock requests, so that OPEN and LOCK are always serialised. That works fine, except when we try to grab both an open_seqid and a lock_seqid: only one of those two can be at the head of the queue... The first patch (#14366) didn't resolve the problem here. Trying the #14371 now. It works with both fixes applied. Thanks a lot for the fast resolution! Do you want me to test the last patch against vanilla -rc7? There shouldn't be any changes in -rc7 compared to what I understand you've been testing, but if you have the time, then that would certainly be useful. Thanks! OK, I will prepare following condition: Server: 2.6.24-rc7-git1 (x86_64) no patches Client: 2.6.24-rc7-git1 (x86_64) && patch #14371 I'll try to add another i386 client with same patchlevel. It seem to work fine here in the configuration mentioned above. I think this can be closed. Will both patches reach mainline before the .24 release? I've pushed the second patch to Linus. The first should really be up to Bruce Fields, since it is a server bug. AFAICS, he has queued it for 2.6.25... Patch in the Linus' tree: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d0dc3701cb46f73cf8ca393f62e325065b0bbd03 "The first should really be up to Bruce Fields, since it is a server bug. AFAICS, he has queued it for 2.6.25..." Yes. I'm assuming it isn't urgent enough to be pushed to 2.6.24 at this point. |