Bug 15578
Summary: | Crashes after a few stacktraces in flush and kswapd | ||
---|---|---|---|
Product: | File System | Reporter: | lkolbe |
Component: | NFS | Assignee: | Trond Myklebust (trondmy) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | andyrtr, rhuddusa, sfrey, tgr.kb, v.kuklin |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33 2.6.32 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
traces befor the crash with 2.6.32.10
traces befor the crash on 2.6.33 NFS: Prevent another deadlock in nfs_release_page() dmesg of backup pool ps auxf of backup pool during backup "vmstat 60" during backup 2.6.32.12 - kswapd trace after copying a large file over NFS another kswapd stacktrace while copying some GBs |
Description
lkolbe
2010-03-19 09:46:53 UTC
Created attachment 25608 [details]
traces befor the crash on 2.6.33
What might be noteworthy is that we use nfs4/krb5 and nfs3 on simon, wheras river sees mainly nfs3 usage or nfs4 with sec=sys. Created attachment 25612 [details]
NFS: Prevent another deadlock in nfs_release_page()
Does this bug fix the hang?
...I mean "does this _patch_ fix the hang?" :-) Now that was quick, thanks a million! We reboot the system tonight and would expect a crash to be happening after a day or two of normal workload (weekdays), so we should know more by the end of next week. Kind regards, Luksa This patch indeed does seem to fix our nfs-related crashes. So far there is not a single new line in dmesg other than the usual nscd crash (already reported to debian as http://bugs.debian.org/574990). The server is only lightly loaded at the moment, but it seems to work flawlessly so far. Could this be a candidate fore 2.6.32 stable inclusion? If Ubuntu and Debian are going to use 2.6.32 for their next releases, we'd do so as well. Kind regards, Lukas OK. I will push it to Linus ASAP... Once that is done, the stable kernels should pick it up automatically. *** Bug 15552 has been marked as a duplicate of this bug. *** We haven't had another crash yet, but our disk-based backup running 2.6.33 with this patch has shown more 'task blocked for more than'... messages, see attachment. This happened while doing about 6 simultaneous rdiff-backup sessions from different servers over two bonded 1Gbit interfaces. dmesg, vmstat and ps auxf attached ... though this could just as well be the backup machine saying 'please wait and don't give me any more traffic I can't handle', because the array (24 disk SATA RAID60 on an Adaptec Series 5 controller) is rebuilding with medium priority, yielding a load of 65. Uptime: 09:09:28 up 14:05, 4 users, load average: 64.43, 59.53, 55.10 Please tell me this is just an overloaded server and nothing to worry about ;) Created attachment 25676 [details]
dmesg of backup pool
Created attachment 25677 [details]
ps auxf of backup pool during backup
Created attachment 25678 [details]
"vmstat 60" during backup
Not sure what that is due to, but it is definitely a server bug, and is unrelated to this one. I'd suggest filing it as a separate bug, so that Bruce can look over it. Okay, done with https://bugzilla.kernel.org/show_bug.cgi?id=15622 Fix was merged into mainline as commit d812e575822a2b7ab1a7cadae2571505ec6ec2bd with a Cc: stable@kernel.org. I'm therefore marking this bug as CLOSED. Please reopen if the problem reoccurs. Created attachment 26519 [details]
2.6.32.12 - kswapd trace after copying a large file over NFS
I'm still getting this with 2.6.32.12. It's 100% reproducible and happens when copying large files over NFS.
I'm also getting this with 2.6.32.12. Created attachment 26537 [details]
another kswapd stacktrace while copying some GBs
I'm sorry, my kernel is 2.6.32-11. I'm experiencing this problem while copying some GBs of files over NFS.
The stacktrace is attached.
Vladimir, 2.6.32-11 looks like a distro kernel. Please try to reproduce it with a vanilla mainline kernel. I'm quite sure the developers will not debug issues with distro kernels. I can reproduce it every time with a vanilla kernel. Please set up a new bugzilla entry for these two traces. I can't see any evidence that you are reporting the same issue here. Please also run an 'echo t >/proc/sysrq-trigger; dmesg -s 900000 > /tmp/trace.txt' and post the resulting trace from /tmp/trace.tst OK, the trace looks very similar to #15552, which is marked as a duplicate of this bug. But if you're positive, I will make a new entry and post the trace. It looks similar to the WARNING, which is basically contentless. All it says is that "I'm waiting for something to happen somewhere else". The question is what is happening somewhere else. In the case of 15552, it turned out to be the same as above (a deadlock due to a filesystem memory allocation recursing back into the same filesystem), and hence it was fixed by the same patch. |