Bug 38302
Summary: | NFS crash in un-modified 3.0.0-rc3+, list corruption. | ||
---|---|---|---|
Product: | File System | Reporter: | Maciej Rutecki (maciej.rutecki) |
Component: | NFS | Assignee: | bfields |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | florian, greearb, maciej.rutecki, rjw, trondmy |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.0-rc3+ | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 36912 | ||
Attachments: | svcrpc: fix list-corrupting race on nfsd shutdown |
Description
Maciej Rutecki
2011-06-26 20:35:25 UTC
Reassigning to Bruce, since this appears to be a server bug. We hit a similar bug on a hacked 2.6.38.8 kernel, so I don't think this is a recent regression. Also, the file-IO traffic in this test case was 2k read/writes using O_DIRECT. I'm not sure if that matters or not. How similar? Do you have a backtrace? Do you know a recent kernel on which this *didn't* occur? I don't have useful info on the previous kernels for this particular issue, and haven't had time to test on older clean kernels. I'm trying to debug some client side issues (in a patched kernel), and after I get that resolved, we can do some tests on older kernels for this particular issue. I found my notes from our internal testing, and we verified that the crash happens in 2.6.39.1, so this is not a regression in 3.0. We will test some older kernels when we have time. If there are any particular ones of interest, please let me know. Hm, I wonder if I screwed this up in the supposed "cleanup" of comit f8c0d226fe "svcrpc: simplify svc_close_all", which went into 2.6.37. Created attachment 63902 [details]
svcrpc: fix list-corrupting race on nfsd shutdown
Looking again, I think that just shuffled the problem around, and it was originaly introduced way back in 2.6.29. We didn't notice it because it's a relatively rare race, without consequences unless the memory allocated by this xprt is immediately allocated to something else and corrupted by the list_add.
I think.
Anyway, does the attached help?
Also, out of curiosity: why is your server getting restarted so often?
Yes, that fixes it. That is the same functional change as what I posted in the original email, though your comments are much better: http://www.spinics.net/lists/linux-nfs/msg22169.html I've tested that fix in -rc4, so feel free to add my tested-by. Sorry, forgot to answer your other question: We were restarting so often to test client side handling of server restarts. But, with 200+ active mounts doing steady O_DIRECT traffic, it only takes a few restarts to hit the crash on our server. Oh crap, I suck--I saw the email but must not have paged down to the bottom. I could've saved myself some trouble. Thanks for the report and testing. Fixed by commit ebc63e531cc6a457595dd110b07ac530eae788c3 . |