Bug 24302
Summary: | Kernel crashes when repeatedly trying to mount nfs share that is failing | ||
---|---|---|---|
Product: | File System | Reporter: | Stefan Bader (stefan.bader) |
Component: | NFS | Assignee: | Trond Myklebust (trondmy) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | akpm, etienne.goyer, florian |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 2.6.32, 2.6.37-rc3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Screenshot of panic on 2.6.37-rc3
System log with autofs debug turned on Patch to set the number of rpc commands right |
Description
Stefan Bader
2010-12-03 20:20:55 UTC
Created attachment 38992 [details]
Screenshot of panic on 2.6.37-rc3
This is maybe a bit hidden but this bug seems to be present already in 2.6.32. Created attachment 39002 [details]
System log with autofs debug turned on
This it the syslog from a kernel (note that this is back to a 2.6.32 based kernel as things seem to happen slightly quicker there) with several debug facilities turned on.
It seems I was looking at the wrong part. I had been tracing into the autofs code and really could not see things going wrong there. Then someone mentioned that there might be a difference whether the mount fails because the server is unreachable or denying the mount for security reasons. So I just tried the following: In a loop: - create the mount pt. directory - try to mount the nfs share - (in case that succeeds unmount it (never happens)) - remove the mount pt. directory After a few loops of this, I saw a crash. So the whole interaction with autofs is not needed either. It is down to this case. Next I will give it a quick test to see whether creating and removing the mount pt. is needed here. So the mount pt. can stay static. The problem is to repeat trying to mount the nfs share. So this report should be reassigned to NFS? I would have done that. Though from my latest findings it might be nfs doing something incorrect or rpc implementing it bad. By now I tracked the issue down to the call for nfs_umount() when nfs_walk_authlist() finds no authentication method in common for client and server. Commenting out this call will get rid of the crash. What the nfs_umount call does is, creating a rpc client using UDP and submitting a UMNT message to the server (which is advisory and is not expected any result) and as soon as it is sent, tear down the client. What I saw in tcp traces was that there actually is a single packet going back from the server to the client RPC port saying the server was not listening anyway (ICMP destination port unreachable). I added some delay between sending the message and the tear down and this shows incoming data in the RPC debug facility. Unfortunately it does not prevent the crash from happening. Which makes me unsure whether it is this packet or something else. On the other hand I believe RPC itself is using a UDP client to get a destination port. The only difference is that it sets a decoding function (thus stating there is some return) while the nfs client does not set it. But I guess re-assignment to NFS would be best at this time. Maybe someone there can hint me further. OK, I reassigned it to NFS. If that was wrong then at least the NFS guys should be able to help point things in the right direction. On Wed, 2010-12-08 at 18:30 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=24302 > > > Andrew Morton <akpm@linux-foundation.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Component|Other |NFS > AssignedTo|fs_other@kernel-bugs.osdl.o |trond.myklebust@fys.uio.no > |rg | > > > > > --- Comment #8 from Andrew Morton <akpm@linux-foundation.org> 2010-12-08 > 18:30:32 --- > OK, I reassigned it to NFS. If that was wrong then at least the NFS guys > should be able to help point things in the right direction. <Switching to email interface. Please do not edit the bugzilla entry directly, since that will lose the above Cc information> Chuck, Stefan appears to be hitting a panic in the nfs_umount() call from nfs_walk_authlist(). Can you take a look, please? Cheers Trond On Dec 8, 2010, at 2:03 PM, Trond Myklebust wrote:
> On Wed, 2010-12-08 at 18:30 +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=24302
>>
>>
>> Andrew Morton <akpm@linux-foundation.org> changed:
>>
>> What |Removed |Added
>> ----------------------------------------------------------------------------
>> Component|Other |NFS
>> AssignedTo|fs_other@kernel-bugs.osdl.o |trond.myklebust@fys.uio.no
>> |rg |
>>
>>
>>
>>
>> --- Comment #8 from Andrew Morton <akpm@linux-foundation.org> 2010-12-08
>> 18:30:32 ---
>> OK, I reassigned it to NFS. If that was wrong then at least the NFS guys
>> should be able to help point things in the right direction.
>
> <Switching to email interface. Please do not edit the bugzilla entry
> directly, since that will lose the above Cc information>
>
>
> Chuck,
>
> Stefan appears to be hitting a panic in the nfs_umount() call from
> nfs_walk_authlist(). Can you take a look, please?
Recv'd. I'll have a look.
Hi Stefan-
On Dec 8, 2010, at 3:35 PM, Chuck Lever wrote:
>
> On Dec 8, 2010, at 2:03 PM, Trond Myklebust wrote:
>
>> On Wed, 2010-12-08 at 18:30 +0000, bugzilla-daemon@bugzilla.kernel.org
>> wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=24302
>>>
>>>
>>> Andrew Morton <akpm@linux-foundation.org> changed:
>>>
>>> What |Removed |Added
>>>
>>> ----------------------------------------------------------------------------
>>> Component|Other |NFS
>>> AssignedTo|fs_other@kernel-bugs.osdl.o |trond.myklebust@fys.uio.no
>>> |rg |
>>>
>>>
>>>
>>>
>>> --- Comment #8 from Andrew Morton <akpm@linux-foundation.org> 2010-12-08
>>> 18:30:32 ---
>>> OK, I reassigned it to NFS. If that was wrong then at least the NFS guys
>>> should be able to help point things in the right direction.
>>
>> <Switching to email interface. Please do not edit the bugzilla entry
>> directly, since that will lose the above Cc information>
>>
>>
>> Chuck,
>>
>> Stefan appears to be hitting a panic in the nfs_umount() call from
>> nfs_walk_authlist(). Can you take a look, please?
>
> Recv'd. I'll have a look.
Apologies in advance for the attachment. There are a few other clean ups that can be done, but this seems to be the minimal fix. Please try this and let us know if it addresses your panic.
Created attachment 39542 [details]
Patch to set the number of rpc commands right
So simple and so annoying. I spent quite some amount of time trying to understand what is going wrong but was overlooking this little piece (as the crash not happened all the time I was trying to chase a race)!
<some time spent banging my head on the keyboard later>
Ok, I can confirm that this change fixes the crash observed. Thanks, Chuck! I am attaching a slightly modified version of the patch (just changed some record keeping formatting for the bug reference and added one link and also added by tested-by). Is it possible to use that for sending it upstream? It also contains the required magic to let it go into stable.
the patch has been merged in .37-rc6: commit 5b362ac3799ff4225c40935500f520cad4d7ed66 Author: Chuck Lever <chuck.lever@oracle.com> Date: Fri Dec 10 12:31:14 2010 -0500 NFS: Fix panic after nfs_umount() |