|Summary:||NFS crash in un-modified 3.0.0-rc3+, list corruption.|
|Product:||File System||Reporter:||Maciej Rutecki (maciej.rutecki)|
|Severity:||normal||CC:||florian, greearb, maciej.rutecki, rjw, trondmy|
|Bug Depends on:|
|Attachments:||svcrpc: fix list-corrupting race on nfsd shutdown|
Description Maciej Rutecki 2011-06-26 20:35:25 UTC
Subject : NFS crash in un-modified 3.0.0-rc3+, list corruption. Submitter : Ben Greear <email@example.com> Date : 2011-06-20 22:42 Message-ID : 4DFFCCDC.firstname.lastname@example.org References : http://marc.info/?l=linux-kernel&m=130860979610651&w=2 This entry is being used for tracking a regression from 2.6.39. Please don't close it until the problem is fixed in the mainline.
Comment 1 Trond Myklebust 2011-06-26 22:13:07 UTC
Reassigning to Bruce, since this appears to be a server bug.
Comment 2 Ben Greear 2011-06-27 16:58:58 UTC
We hit a similar bug on a hacked 220.127.116.11 kernel, so I don't think this is a recent regression. Also, the file-IO traffic in this test case was 2k read/writes using O_DIRECT. I'm not sure if that matters or not.
Comment 3 bfields 2011-06-28 19:18:30 UTC
How similar? Do you have a backtrace? Do you know a recent kernel on which this *didn't* occur?
Comment 4 Ben Greear 2011-06-28 19:26:35 UTC
I don't have useful info on the previous kernels for this particular issue, and haven't had time to test on older clean kernels. I'm trying to debug some client side issues (in a patched kernel), and after I get that resolved, we can do some tests on older kernels for this particular issue.
Comment 5 Ben Greear 2011-06-29 16:42:41 UTC
I found my notes from our internal testing, and we verified that the crash happens in 18.104.22.168, so this is not a regression in 3.0. We will test some older kernels when we have time. If there are any particular ones of interest, please let me know.
Comment 6 bfields 2011-06-29 20:06:59 UTC
Hm, I wonder if I screwed this up in the supposed "cleanup" of comit f8c0d226fe "svcrpc: simplify svc_close_all", which went into 2.6.37.
Comment 7 bfields 2011-06-29 21:03:04 UTC
Created attachment 63902 [details] svcrpc: fix list-corrupting race on nfsd shutdown Looking again, I think that just shuffled the problem around, and it was originaly introduced way back in 2.6.29. We didn't notice it because it's a relatively rare race, without consequences unless the memory allocated by this xprt is immediately allocated to something else and corrupted by the list_add. I think. Anyway, does the attached help? Also, out of curiosity: why is your server getting restarted so often?
Comment 8 Ben Greear 2011-06-29 22:39:27 UTC
Yes, that fixes it. That is the same functional change as what I posted in the original email, though your comments are much better: http://www.spinics.net/lists/linux-nfs/msg22169.html I've tested that fix in -rc4, so feel free to add my tested-by.
Comment 9 Ben Greear 2011-06-29 22:41:26 UTC
Sorry, forgot to answer your other question: We were restarting so often to test client side handling of server restarts. But, with 200+ active mounts doing steady O_DIRECT traffic, it only takes a few restarts to hit the crash on our server.
Comment 10 bfields 2011-06-29 22:44:00 UTC
Oh crap, I suck--I saw the email but must not have paged down to the bottom. I could've saved myself some trouble. Thanks for the report and testing.
Comment 11 Florian Mickler 2011-07-02 19:32:34 UTC
Comment 12 Rafael J. Wysocki 2011-08-14 19:10:36 UTC
Fixed by commit ebc63e531cc6a457595dd110b07ac530eae788c3 .