Kernel Bug Tracker – Bug 38302
NFS crash in un-modified 3.0.0-rc3+, list corruption.
Last modified: 2011-08-14 19:10:36 UTC
Subject : NFS crash in un-modified 3.0.0-rc3+, list corruption.
Submitter : Ben Greear <firstname.lastname@example.org>
Date : 2011-06-20 22:42
Message-ID : 4DFFCCDC.email@example.com
References : http://marc.info/?l=linux-kernel&m=130860979610651&w=2
This entry is being used for tracking a regression from 2.6.39. Please don't
close it until the problem is fixed in the mainline.
Reassigning to Bruce, since this appears to be a server bug.
We hit a similar bug on a hacked 126.96.36.199 kernel, so I don't think
this is a recent regression.
Also, the file-IO traffic in this test case was 2k read/writes using O_DIRECT.
I'm not sure if that matters or not.
How similar? Do you have a backtrace?
Do you know a recent kernel on which this *didn't* occur?
I don't have useful info on the previous kernels for this particular issue, and haven't had time to test on older clean kernels.
I'm trying to debug some client side issues (in a patched kernel), and after I get that resolved, we can do some tests on older kernels for this particular issue.
I found my notes from our internal testing, and we verified that the crash happens in 188.8.131.52, so this is not a regression in 3.0. We will test some older kernels when we have time. If there are any particular ones of interest, please let me know.
Hm, I wonder if I screwed this up in the supposed "cleanup" of comit f8c0d226fe
"svcrpc: simplify svc_close_all", which went into 2.6.37.
Created attachment 63902 [details]
svcrpc: fix list-corrupting race on nfsd shutdown
Looking again, I think that just shuffled the problem around, and it was originaly introduced way back in 2.6.29. We didn't notice it because it's a relatively rare race, without consequences unless the memory allocated by this xprt is immediately allocated to something else and corrupted by the list_add.
Anyway, does the attached help?
Also, out of curiosity: why is your server getting restarted so often?
Yes, that fixes it. That is the same functional change as what I posted in the original email, though your comments are much better:
I've tested that fix in -rc4, so feel free to add my tested-by.
Sorry, forgot to answer your other question:
We were restarting so often to test client side handling of server restarts.
But, with 200+ active mounts doing steady O_DIRECT traffic, it only takes a few restarts to hit the crash on our server.
Oh crap, I suck--I saw the email but must not have paged down to the bottom. I could've saved myself some trouble.
Thanks for the report and testing.
Fixed by commit ebc63e531cc6a457595dd110b07ac530eae788c3 .