Subject : NFS crash in un-modified 3.0.0-rc3+, list corruption. Submitter : Ben Greear <greearb@candelatech.com> Date : 2011-06-20 22:42 Message-ID : 4DFFCCDC.2070106@candelatech.com References : http://marc.info/?l=linux-kernel&m=130860979610651&w=2 This entry is being used for tracking a regression from 2.6.39. Please don't close it until the problem is fixed in the mainline.
Reassigning to Bruce, since this appears to be a server bug.
We hit a similar bug on a hacked 2.6.38.8 kernel, so I don't think this is a recent regression. Also, the file-IO traffic in this test case was 2k read/writes using O_DIRECT. I'm not sure if that matters or not.
How similar? Do you have a backtrace? Do you know a recent kernel on which this *didn't* occur?
I don't have useful info on the previous kernels for this particular issue, and haven't had time to test on older clean kernels. I'm trying to debug some client side issues (in a patched kernel), and after I get that resolved, we can do some tests on older kernels for this particular issue.
I found my notes from our internal testing, and we verified that the crash happens in 2.6.39.1, so this is not a regression in 3.0. We will test some older kernels when we have time. If there are any particular ones of interest, please let me know.
Hm, I wonder if I screwed this up in the supposed "cleanup" of comit f8c0d226fe "svcrpc: simplify svc_close_all", which went into 2.6.37.
Created attachment 63902 [details] svcrpc: fix list-corrupting race on nfsd shutdown Looking again, I think that just shuffled the problem around, and it was originaly introduced way back in 2.6.29. We didn't notice it because it's a relatively rare race, without consequences unless the memory allocated by this xprt is immediately allocated to something else and corrupted by the list_add. I think. Anyway, does the attached help? Also, out of curiosity: why is your server getting restarted so often?
Yes, that fixes it. That is the same functional change as what I posted in the original email, though your comments are much better: http://www.spinics.net/lists/linux-nfs/msg22169.html I've tested that fix in -rc4, so feel free to add my tested-by.
Sorry, forgot to answer your other question: We were restarting so often to test client side handling of server restarts. But, with 200+ active mounts doing steady O_DIRECT traffic, it only takes a few restarts to hit the crash on our server.
Oh crap, I suck--I saw the email but must not have paged down to the bottom. I could've saved myself some trouble. Thanks for the report and testing.
Patch: https://bugzilla.kernel.org/attachment.cgi?id=63902
Fixed by commit ebc63e531cc6a457595dd110b07ac530eae788c3 .