Bug 24302

Summary: Kernel crashes when repeatedly trying to mount nfs share that is failing
Product: File System Reporter: Stefan Bader (stefan.bader)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, etienne.goyer, florian
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 2.6.32, 2.6.37-rc3 Subsystem:
Regression: No Bisected commit-id:
Attachments: Screenshot of panic on 2.6.37-rc3
System log with autofs debug turned on
Patch to set the number of rpc commands right

Description Stefan Bader 2010-12-03 20:20:55 UTC
ARCH: x86_64 (not tested on i386)

Testcase:

On the server side, configure an export that will be denied to mount (used sec=krb5 as that gives a permission denied for me)

--- server /etc/exports ---
/foo *(rw,krb5)

On the client side, autofs gets configured to try to automount server:/foo

--- client /etc/auto.master ---
/auto /etc/auto.auto

--- client /etc/auto.auto ---
foo -tcp server:/foo

[optional]
Stop the autofs service and start the automounter in the foreground (automount -d -f) to see the permission denied message.

Create a symlink pointing to the autofs:
ln -s /auto/foo foo

The following is done in a loop as the crash sometimes takes longer than other times:

while true; do ls foo/; sleep 1; done

It seems to take longer on the newest kernel, but eventually there will be a panic.
Comment 1 Stefan Bader 2010-12-03 20:28:10 UTC
Created attachment 38992 [details]
Screenshot of panic on 2.6.37-rc3
Comment 2 Stefan Bader 2010-12-03 21:09:47 UTC
This is maybe a bit hidden but this bug seems to be present already in 2.6.32.
Comment 3 Stefan Bader 2010-12-03 22:07:57 UTC
Created attachment 39002 [details]
System log with autofs debug turned on

This it the syslog from a kernel (note that this is back to a 2.6.32 based kernel as things seem to happen slightly quicker there) with several debug facilities turned on.
Comment 4 Stefan Bader 2010-12-07 08:41:51 UTC
It seems I was looking at the wrong part. I had been tracing into the autofs code and really could not see things going wrong there. Then someone mentioned that there might be a difference whether the mount fails because the server is unreachable or denying the mount for security reasons.

So I just tried the following:

In a loop:
- create the mount pt. directory
- try to mount the nfs share
- (in case that succeeds unmount it (never happens))
- remove the mount pt. directory

After a few loops of this, I saw a crash. So the whole interaction with autofs is not needed either. It is down to this case. Next I will give it a quick test to see whether creating and removing the mount pt. is needed here.
Comment 5 Stefan Bader 2010-12-07 09:10:43 UTC
So the mount pt. can stay static. The problem is to repeat trying to mount the nfs share.
Comment 6 Andrew Morton 2010-12-08 01:06:34 UTC
So this report should be reassigned to NFS?
Comment 7 Stefan Bader 2010-12-08 13:03:53 UTC
I would have done that. Though from my latest findings it might be nfs doing something incorrect or rpc implementing it bad.

By now I tracked the issue down to the call for nfs_umount() when nfs_walk_authlist() finds no authentication method in common for client and server. Commenting out this call will get rid of the crash.

What the nfs_umount call does is, creating a rpc client using UDP and submitting a UMNT message to the server (which is advisory and is not expected any result) and as soon as it is sent, tear down the client.

What I saw in tcp traces was that there actually is a single packet going back from the server to the client RPC port saying the server was not listening anyway (ICMP destination port unreachable). I added some delay between sending the message and the tear down and this shows incoming data in the RPC debug facility. Unfortunately it does not prevent the crash from happening. Which makes me unsure whether it is this packet or something else. On the other hand I believe RPC itself is using a UDP client to get a destination port. The only difference is that it sets a decoding function (thus stating there is some return) while the nfs client does not set it.

But I guess re-assignment to NFS would be best at this time. Maybe someone there can hint me further.
Comment 8 Andrew Morton 2010-12-08 18:30:32 UTC
OK, I reassigned it to NFS.  If that was wrong then at least the NFS guys should be able to help point things in the right direction.
Comment 9 Trond Myklebust 2010-12-08 20:02:49 UTC
On Wed, 2010-12-08 at 18:30 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=24302
> 
> 
> Andrew Morton <akpm@linux-foundation.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>           Component|Other                       |NFS
>          AssignedTo|fs_other@kernel-bugs.osdl.o |trond.myklebust@fys.uio.no
>                    |rg                          |
> 
> 
> 
> 
> --- Comment #8 from Andrew Morton <akpm@linux-foundation.org>  2010-12-08
> 18:30:32 ---
> OK, I reassigned it to NFS.  If that was wrong then at least the NFS guys
> should be able to help point things in the right direction.

<Switching to email interface. Please do not edit the bugzilla entry
directly, since that will lose the above Cc information>


Chuck,

Stefan appears to be hitting a panic in the nfs_umount() call from
nfs_walk_authlist(). Can you take a look, please?

Cheers
  Trond
Comment 10 Chuck Lever 2010-12-08 20:36:56 UTC
On Dec 8, 2010, at 2:03 PM, Trond Myklebust wrote:

> On Wed, 2010-12-08 at 18:30 +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=24302
>> 
>> 
>> Andrew Morton <akpm@linux-foundation.org> changed:
>> 
>>           What    |Removed                     |Added
>> ----------------------------------------------------------------------------
>>          Component|Other                       |NFS
>>         AssignedTo|fs_other@kernel-bugs.osdl.o |trond.myklebust@fys.uio.no
>>                   |rg                          |
>> 
>> 
>> 
>> 
>> --- Comment #8 from Andrew Morton <akpm@linux-foundation.org>  2010-12-08
>> 18:30:32 ---
>> OK, I reassigned it to NFS.  If that was wrong then at least the NFS guys
>> should be able to help point things in the right direction.
> 
> <Switching to email interface. Please do not edit the bugzilla entry
> directly, since that will lose the above Cc information>
> 
> 
> Chuck,
> 
> Stefan appears to be hitting a panic in the nfs_umount() call from
> nfs_walk_authlist(). Can you take a look, please?

Recv'd.  I'll have a look.
Comment 11 Chuck Lever 2010-12-09 00:19:36 UTC
Hi Stefan-

On Dec 8, 2010, at 3:35 PM, Chuck Lever wrote:

> 
> On Dec 8, 2010, at 2:03 PM, Trond Myklebust wrote:
> 
>> On Wed, 2010-12-08 at 18:30 +0000, bugzilla-daemon@bugzilla.kernel.org
>> wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=24302
>>> 
>>> 
>>> Andrew Morton <akpm@linux-foundation.org> changed:
>>> 
>>>          What    |Removed                     |Added
>>>
>>> ----------------------------------------------------------------------------
>>>         Component|Other                       |NFS
>>>        AssignedTo|fs_other@kernel-bugs.osdl.o |trond.myklebust@fys.uio.no
>>>                  |rg                          |
>>> 
>>> 
>>> 
>>> 
>>> --- Comment #8 from Andrew Morton <akpm@linux-foundation.org>  2010-12-08
>>> 18:30:32 ---
>>> OK, I reassigned it to NFS.  If that was wrong then at least the NFS guys
>>> should be able to help point things in the right direction.
>> 
>> <Switching to email interface. Please do not edit the bugzilla entry
>> directly, since that will lose the above Cc information>
>> 
>> 
>> Chuck,
>> 
>> Stefan appears to be hitting a panic in the nfs_umount() call from
>> nfs_walk_authlist(). Can you take a look, please?
> 
> Recv'd.  I'll have a look.

Apologies in advance for the attachment.  There are a few other clean ups that can be done, but this seems to be the minimal fix.  Please try this and let us know if it addresses your panic.
Comment 12 Stefan Bader 2010-12-09 10:03:44 UTC
Created attachment 39542 [details]
Patch to set the number of rpc commands right

So simple and so annoying. I spent quite some amount of time trying to understand what is going wrong but was overlooking this little piece (as the crash not happened all the time I was trying to chase a race)!

<some time spent banging my head on the keyboard later>

Ok, I can confirm that this change fixes the crash observed. Thanks, Chuck! I am attaching a slightly modified version of the patch (just changed some record keeping formatting for the bug reference and added one link and also added by tested-by). Is it possible to use that for sending it upstream? It also contains the required magic to let it go into stable.
Comment 13 Florian Mickler 2011-01-24 12:50:56 UTC
the patch has been merged in .37-rc6:
commit 5b362ac3799ff4225c40935500f520cad4d7ed66
Author: Chuck Lever <chuck.lever@oracle.com>
Date:   Fri Dec 10 12:31:14 2010 -0500

    NFS: Fix panic after nfs_umount()