Bug 38302

Summary: NFS crash in un-modified 3.0.0-rc3+, list corruption.
Product: File System Reporter: Maciej Rutecki (maciej.rutecki)
Component: NFSAssignee: bfields
Status: CLOSED CODE_FIX    
Severity: normal CC: florian, greearb, maciej.rutecki, rjw, trondmy
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.0-rc3+ Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 36912    
Attachments: svcrpc: fix list-corrupting race on nfsd shutdown

Description Maciej Rutecki 2011-06-26 20:35:25 UTC
Subject    : NFS crash in un-modified 3.0.0-rc3+, list corruption.
Submitter  : Ben Greear <greearb@candelatech.com>
Date       : 2011-06-20 22:42
Message-ID : 4DFFCCDC.2070106@candelatech.com
References : http://marc.info/?l=linux-kernel&m=130860979610651&w=2

This entry is being used for tracking a regression from 2.6.39. Please don't
close it until the problem is fixed in the mainline.
Comment 1 Trond Myklebust 2011-06-26 22:13:07 UTC
Reassigning to Bruce, since this appears to be a server bug.
Comment 2 Ben Greear 2011-06-27 16:58:58 UTC
We hit a similar bug on a hacked 2.6.38.8 kernel, so I don't think
this is a recent regression.

Also, the file-IO traffic in this test case was 2k read/writes using O_DIRECT.
I'm not sure if that matters or not.
Comment 3 bfields 2011-06-28 19:18:30 UTC
How similar?  Do you have a backtrace?

Do you know a recent kernel on which this *didn't* occur?
Comment 4 Ben Greear 2011-06-28 19:26:35 UTC
I don't have useful info on the previous kernels for this particular issue, and haven't had time to test on older clean kernels.

I'm trying to debug some client side issues (in a patched kernel), and after I get that resolved, we can do some tests on older kernels for this particular issue.
Comment 5 Ben Greear 2011-06-29 16:42:41 UTC
I found my notes from our internal testing, and we verified that the crash happens in 2.6.39.1, so this is not a regression in 3.0.  We will test some older kernels when we have time.  If there are any particular ones of interest, please let me know.
Comment 6 bfields 2011-06-29 20:06:59 UTC
Hm, I wonder if I screwed this up in the supposed "cleanup" of comit f8c0d226fe
"svcrpc: simplify svc_close_all", which went into 2.6.37.
Comment 7 bfields 2011-06-29 21:03:04 UTC
Created attachment 63902 [details]
svcrpc: fix list-corrupting race on nfsd shutdown

Looking again, I think that just shuffled the problem around, and it was originaly introduced way back in 2.6.29.  We didn't notice it because it's a relatively rare race, without consequences unless the memory allocated by this xprt is immediately allocated to something else and corrupted by the list_add.

I think.

Anyway, does the attached help?

Also, out of curiosity: why is your server getting restarted so often?
Comment 8 Ben Greear 2011-06-29 22:39:27 UTC
Yes, that fixes it.  That is the same functional change as what I posted in the original email, though your comments are much better:

http://www.spinics.net/lists/linux-nfs/msg22169.html

I've tested that fix in -rc4, so feel free to add my tested-by.
Comment 9 Ben Greear 2011-06-29 22:41:26 UTC
Sorry, forgot to answer your other question:

We were restarting so often to test client side handling of server restarts.

But, with 200+ active mounts doing steady O_DIRECT traffic, it only takes a few restarts to hit the crash on our server.
Comment 10 bfields 2011-06-29 22:44:00 UTC
Oh crap, I suck--I saw the email but must not have paged down to the bottom.  I could've saved myself some trouble.

Thanks for the report and testing.
Comment 11 Florian Mickler 2011-07-02 19:32:34 UTC
Patch: https://bugzilla.kernel.org/attachment.cgi?id=63902
Comment 12 Rafael J. Wysocki 2011-08-14 19:10:36 UTC
Fixed by commit ebc63e531cc6a457595dd110b07ac530eae788c3 .