Bug 15578 - Crashes after a few stacktraces in flush and kswapd
Summary: Crashes after a few stacktraces in flush and kswapd
Status: CLOSED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Trond Myklebust
URL:
Keywords:
: 15552 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-03-19 09:46 UTC by lkolbe
Modified: 2010-07-13 02:04 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.33 2.6.32
Tree: Mainline
Regression: Yes


Attachments
traces befor the crash with 2.6.32.10 (44.68 KB, text/plain)
2010-03-19 09:46 UTC, lkolbe
Details
traces befor the crash on 2.6.33 (42.84 KB, text/plain)
2010-03-19 09:47 UTC, lkolbe
Details
NFS: Prevent another deadlock in nfs_release_page() (899 bytes, patch)
2010-03-19 12:52 UTC, Trond Myklebust
Details | Diff
dmesg of backup pool (90.53 KB, application/octet-stream)
2010-03-24 09:12 UTC, lkolbe
Details
ps auxf of backup pool during backup (56.97 KB, application/octet-stream)
2010-03-24 09:12 UTC, lkolbe
Details
"vmstat 60" during backup (4.81 KB, application/octet-stream)
2010-03-24 09:13 UTC, lkolbe
Details
2.6.32.12 - kswapd trace after copying a large file over NFS (2.55 KB, text/plain)
2010-05-24 12:37 UTC, Leszek Urbanski
Details
another kswapd stacktrace while copying some GBs (28.99 KB, text/plain)
2010-05-25 09:11 UTC, Vladimir Kuklin
Details

Description lkolbe 2010-03-19 09:46:53 UTC
Created attachment 25607 [details]
traces befor the crash with 2.6.32.10

Hi, we've seen these traces and crashes afterwards on numerous hosts now with kernels 2.6.32 upwards. I assume it is nfs related as many nfs-calls appear in the trace. The servers are moderatly loaded file servers with Adaptec 52445 controllers and Quad Xeon E5240 with 8GB Ram. River has about 120 exports from 3 filesystems, simon has 4500 exports in 7 filesystems. I've also reported this to the debian bugtracker as http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=574348 as this is Debian Lenny. Any hints how to further debug this would be greatly appreciated - especially the students are not very happy when they loose their sessions for a while ;). We are now back to 2.6.30 as that seems to work, but has other problems (no working quota for instance)

First attachment was 2.6.32.10, second 2.6.33 on a different host.
Comment 1 lkolbe 2010-03-19 09:47:51 UTC
Created attachment 25608 [details]
traces befor the crash on 2.6.33
Comment 2 lkolbe 2010-03-19 09:49:22 UTC
What might be noteworthy is that we use nfs4/krb5 and nfs3 on simon, wheras river sees mainly nfs3 usage or nfs4 with sec=sys.
Comment 3 Trond Myklebust 2010-03-19 12:52:11 UTC
Created attachment 25612 [details]
NFS: Prevent another deadlock in nfs_release_page()

Does this bug fix the hang?
Comment 4 Trond Myklebust 2010-03-19 12:52:44 UTC
...I mean "does this _patch_ fix the hang?" :-)
Comment 5 lkolbe 2010-03-19 16:57:39 UTC
Now that was quick, thanks a million! We reboot the system tonight and would expect a crash to be happening after a day or two of normal workload (weekdays), so we should know more by the end of next week.

Kind regards,
Luksa
Comment 6 lkolbe 2010-03-23 14:40:56 UTC
This patch indeed does seem to fix our nfs-related crashes. So far there is not a single new line in dmesg other than the usual nscd crash (already reported to debian as http://bugs.debian.org/574990). 

The server is only lightly loaded at the moment, but it seems to work flawlessly so far. Could this be a candidate fore 2.6.32 stable inclusion? If Ubuntu and Debian are going to use 2.6.32 for their next releases, we'd do so as well.

Kind regards,
Lukas
Comment 7 Trond Myklebust 2010-03-23 16:57:15 UTC
OK. I will push it to Linus ASAP... Once that is done, the stable kernels should pick it up automatically.
Comment 8 Trond Myklebust 2010-03-23 21:34:13 UTC
*** Bug 15552 has been marked as a duplicate of this bug. ***
Comment 9 lkolbe 2010-03-24 09:10:31 UTC
We haven't had another crash yet, but our disk-based backup running 2.6.33 with this patch has shown more 'task blocked for more than'... messages, see attachment. This happened while doing about 6 simultaneous rdiff-backup sessions from different servers over two bonded 1Gbit interfaces. dmesg, vmstat and ps auxf attached ... though this could just as well be the backup machine saying 'please wait and don't give me any more traffic I can't handle', because the array (24 disk SATA RAID60 on an Adaptec Series 5 controller) is rebuilding with medium priority, yielding a load of 65.
Uptime:
 09:09:28 up 14:05,  4 users,  load average: 64.43, 59.53, 55.10
Please tell me this is just an overloaded server and nothing to worry about ;)
Comment 10 lkolbe 2010-03-24 09:12:14 UTC
Created attachment 25676 [details]
dmesg of backup pool
Comment 11 lkolbe 2010-03-24 09:12:50 UTC
Created attachment 25677 [details]
ps auxf of backup pool during backup
Comment 12 lkolbe 2010-03-24 09:13:43 UTC
Created attachment 25678 [details]
"vmstat 60" during backup
Comment 13 Trond Myklebust 2010-03-24 13:57:50 UTC
Not sure what that is due to, but it is definitely a server bug, and is unrelated to this one.

I'd suggest filing it as a separate bug, so that Bruce can look over it.
Comment 14 lkolbe 2010-03-24 14:39:39 UTC
Okay, done with https://bugzilla.kernel.org/show_bug.cgi?id=15622
Comment 15 Trond Myklebust 2010-03-25 15:50:10 UTC
Fix was merged into mainline as commit d812e575822a2b7ab1a7cadae2571505ec6ec2bd with a Cc: stable@kernel.org.

I'm therefore marking this bug as CLOSED. Please reopen if the problem reoccurs.
Comment 16 Leszek Urbanski 2010-05-24 12:37:56 UTC
Created attachment 26519 [details]
2.6.32.12 - kswapd trace after copying a large file over NFS

I'm still getting this with 2.6.32.12. It's 100% reproducible and happens when copying large files over NFS.
Comment 17 Vladimir Kuklin 2010-05-25 09:03:25 UTC
I'm also getting this with 2.6.32.12.
Comment 18 Vladimir Kuklin 2010-05-25 09:11:56 UTC
Created attachment 26537 [details]
another kswapd stacktrace while copying some GBs

I'm sorry, my kernel is 2.6.32-11. I'm experiencing this problem while copying some GBs of files over NFS.
The stacktrace is attached.
Comment 19 Leszek Urbanski 2010-05-25 21:15:53 UTC
Vladimir, 2.6.32-11 looks like a distro kernel. Please try to reproduce it with a vanilla mainline kernel. I'm quite sure the developers will not debug issues with distro kernels.

I can reproduce it every time with a vanilla kernel.
Comment 20 Trond Myklebust 2010-05-25 23:01:49 UTC
Please set up a new bugzilla entry for these two traces. I can't see any evidence that you are reporting the same issue here.

Please also run an 

'echo t >/proc/sysrq-trigger; dmesg -s 900000 > /tmp/trace.txt'

and post the resulting trace from /tmp/trace.tst
Comment 21 Leszek Urbanski 2010-05-25 23:15:03 UTC
OK, the trace looks very similar to #15552, which is marked as a duplicate of this bug.
But if you're positive, I will make a new entry and post the trace.
Comment 22 Trond Myklebust 2010-05-25 23:28:17 UTC
It looks similar to the WARNING, which is basically contentless. All it says is that "I'm waiting for something to happen somewhere else".

The question is what is happening somewhere else. In the case of 15552, it turned out to be the same as above (a deadlock due to a filesystem memory allocation recursing back into the same filesystem), and hence it was fixed by the same patch.

Note You need to log in before you can comment on or make changes to this bug.