Created attachment 25607 [details]
traces befor the crash with 220.127.116.11
Hi, we've seen these traces and crashes afterwards on numerous hosts now with kernels 2.6.32 upwards. I assume it is nfs related as many nfs-calls appear in the trace. The servers are moderatly loaded file servers with Adaptec 52445 controllers and Quad Xeon E5240 with 8GB Ram. River has about 120 exports from 3 filesystems, simon has 4500 exports in 7 filesystems. I've also reported this to the debian bugtracker as http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=574348 as this is Debian Lenny. Any hints how to further debug this would be greatly appreciated - especially the students are not very happy when they loose their sessions for a while ;). We are now back to 2.6.30 as that seems to work, but has other problems (no working quota for instance)
First attachment was 18.104.22.168, second 2.6.33 on a different host.
Created attachment 25608 [details]
traces befor the crash on 2.6.33
What might be noteworthy is that we use nfs4/krb5 and nfs3 on simon, wheras river sees mainly nfs3 usage or nfs4 with sec=sys.
Created attachment 25612 [details]
NFS: Prevent another deadlock in nfs_release_page()
Does this bug fix the hang?
...I mean "does this _patch_ fix the hang?" :-)
Now that was quick, thanks a million! We reboot the system tonight and would expect a crash to be happening after a day or two of normal workload (weekdays), so we should know more by the end of next week.
This patch indeed does seem to fix our nfs-related crashes. So far there is not a single new line in dmesg other than the usual nscd crash (already reported to debian as http://bugs.debian.org/574990).
The server is only lightly loaded at the moment, but it seems to work flawlessly so far. Could this be a candidate fore 2.6.32 stable inclusion? If Ubuntu and Debian are going to use 2.6.32 for their next releases, we'd do so as well.
OK. I will push it to Linus ASAP... Once that is done, the stable kernels should pick it up automatically.
*** Bug 15552 has been marked as a duplicate of this bug. ***
We haven't had another crash yet, but our disk-based backup running 2.6.33 with this patch has shown more 'task blocked for more than'... messages, see attachment. This happened while doing about 6 simultaneous rdiff-backup sessions from different servers over two bonded 1Gbit interfaces. dmesg, vmstat and ps auxf attached ... though this could just as well be the backup machine saying 'please wait and don't give me any more traffic I can't handle', because the array (24 disk SATA RAID60 on an Adaptec Series 5 controller) is rebuilding with medium priority, yielding a load of 65.
09:09:28 up 14:05, 4 users, load average: 64.43, 59.53, 55.10
Please tell me this is just an overloaded server and nothing to worry about ;)
Created attachment 25676 [details]
dmesg of backup pool
Created attachment 25677 [details]
ps auxf of backup pool during backup
Created attachment 25678 [details]
"vmstat 60" during backup
Not sure what that is due to, but it is definitely a server bug, and is unrelated to this one.
I'd suggest filing it as a separate bug, so that Bruce can look over it.
Okay, done with https://bugzilla.kernel.org/show_bug.cgi?id=15622
Fix was merged into mainline as commit d812e575822a2b7ab1a7cadae2571505ec6ec2bd with a Cc: firstname.lastname@example.org.
I'm therefore marking this bug as CLOSED. Please reopen if the problem reoccurs.
Created attachment 26519 [details]
22.214.171.124 - kswapd trace after copying a large file over NFS
I'm still getting this with 126.96.36.199. It's 100% reproducible and happens when copying large files over NFS.
I'm also getting this with 188.8.131.52.
Created attachment 26537 [details]
another kswapd stacktrace while copying some GBs
I'm sorry, my kernel is 2.6.32-11. I'm experiencing this problem while copying some GBs of files over NFS.
The stacktrace is attached.
Vladimir, 2.6.32-11 looks like a distro kernel. Please try to reproduce it with a vanilla mainline kernel. I'm quite sure the developers will not debug issues with distro kernels.
I can reproduce it every time with a vanilla kernel.
Please set up a new bugzilla entry for these two traces. I can't see any evidence that you are reporting the same issue here.
Please also run an
'echo t >/proc/sysrq-trigger; dmesg -s 900000 > /tmp/trace.txt'
and post the resulting trace from /tmp/trace.tst
OK, the trace looks very similar to #15552, which is marked as a duplicate of this bug.
But if you're positive, I will make a new entry and post the trace.
It looks similar to the WARNING, which is basically contentless. All it says is that "I'm waiting for something to happen somewhere else".
The question is what is happening somewhere else. In the case of 15552, it turned out to be the same as above (a deadlock due to a filesystem memory allocation recursing back into the same filesystem), and hence it was fixed by the same patch.