Bug 15973 - NFS hangs when the user kerberos ticket is refreshed
Summary: NFS hangs when the user kerberos ticket is refreshed
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-14 06:31 UTC by Timo Aaltonen
Modified: 2012-07-11 16:18 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
wireshark dump (296.13 KB, application/octet-stream)
2010-05-14 06:31 UTC, Timo Aaltonen
Details
trace from /var/log/messages (52 bytes, text/plain)
2010-05-14 06:36 UTC, Timo Aaltonen
Details
a new trace from 2.6.34 kernel (377.58 KB, text/plain)
2010-06-03 16:19 UTC, Timo Aaltonen
Details
network dump with 2.6.35.1 (456.09 KB, application/octet-stream)
2010-08-19 09:05 UTC, Timo Aaltonen
Details

Description Timo Aaltonen 2010-05-14 06:31:20 UTC
Created attachment 26373 [details]
wireshark dump

This had been originally reported on the nfsv4 mailing list earlier this year. But now it seems there are two bugs, or at least somewhat different symptoms. Originally, letting the user ticket expire and then unlocking the screen would result in all access to the $HOME get stuck. But these logs were from a slightly different situation; access to $HOME worked from the ssh session which refreshed the ticket, but logging out the desktop session hung.

So what I did:

-log in the desktop
-kinit a short-lived ticket
-locked the screen
-let the ticket expire
-check out the situation as root, load is 0 and everything seems normal (apart from the log spam about the expired ticket when apps try to access $HOME)
-log in remotely with ssh, things seemed to work ok
-unlocked the desktop

See these snippets from the logs, and the timestamps:

/var/log/auth.log:
May 13 18:03:26 nexus6 sshd[25677]: pam_krb5(sshd:auth): user tjaalton authenticated as tjaalton@ORG.AALTO.FI
May 13 18:03:26 nexus6 sshd[25676]: Accepted keyboard-interactive/pam for tjaalton from 130.233.200.112 port 49284 ssh2
May 13 18:03:26 nexus6 sshd[25676]: pam_unix(sshd:session): session opened for user tjaalton by (uid=0)
May 13 18:03:40 nexus6 gnome-screensaver-dialog: pam_krb5(gnome-screensaver:auth): user tjaalton authenticated as tjaalton@ORG.AALTO.FI
May 13 18:03:40 nexus6 gnome-keyring-daemon[31670]: couldn't lock store file: /u/00/tjaalton/unix/.gnome2/keyrings/user.keystore: Input/output error
May 13 18:03:40 nexus6 gnome-keyring-daemon[31670]: couldn't initialize slot with master password: The operation failed
May 13 18:03:40 nexus6 gnome-screensaver-dialog: gkr-pam: unlocked login keyring

/var/log/messages:
May 13 18:03:40 nexus6 kernel: [539392.490985] Error: state manager failed on NFSv4 server home with error 2
.
.

State manager errors flooded the logs right after I unlocked the desktop. Then I got a trace which is attached, and also the dump from wireshark. These SM errors were new, on previous attempts accessing the shares would just hang. In both cases the network dumps look similar though, and load keeps rising due to the hung I/O.

I'll reproduce to see which one happens this time..

client: Ubuntu 10.04 with 2.6.32.11 based kernel, nfs-utils 1.2.2-1 from Debian
server: NetApp ONTAP 7.3.2
KDC: W2008R2, the client rpc.gssd is run with the '-n' option to use the machine ticket created with 'kinit -k HOST$' (and updated every three hours with cron).
Comment 1 Timo Aaltonen 2010-05-14 06:36:14 UTC
Created attachment 26374 [details]
trace from /var/log/messages
Comment 2 Timo Aaltonen 2010-05-14 06:38:10 UTC
Neither NFS_V4_1 nor NFS_FSCACHE are enabled in the client kernel conf, if it matters.
Comment 3 Timo Aaltonen 2010-05-18 07:43:19 UTC
Tried to reproduce, and the exact same thing happened.
Comment 4 Timo Aaltonen 2010-06-03 16:19:16 UTC
Created attachment 26635 [details]
a new trace from 2.6.34 kernel
Comment 5 Timo Aaltonen 2010-08-17 11:03:59 UTC
nothing changed with 2.6.35...

using v3 w. krb5 now, seems solid enough.
Comment 6 Timo Aaltonen 2010-08-19 09:05:22 UTC
Created attachment 27504 [details]
network dump with 2.6.35.1

This is the network dump from a machine in a hung state showing the client trying to lock files and a lot of BAD_STATEID replies from the server (netapp). I suspect these are pending file access attempts started when the ticket was expired, since gssd is generating lots of debug output about failing to create krb5 context for the user.

Or the locks were acquired when the ticket was valid, and renewing it fails because the server has cleared expired state information?? The netapp has
"options.nfs.v4.state_recycle_timeout=45" (default, undocumented value), dunno if it might have something to do with this..
Comment 7 Timo Aaltonen 2010-11-23 08:08:40 UTC
Testing 2.6.37rc2 shows that this seems to be fixed there. I'm not sure which commits did it, but suspect it's one of these or both:

      NFSv4: Don't call nfs4_reclaim_complete() on receiving NFS4ERR_STALE_CLIENTID
      NFSv4: The state manager must ignore EKEYEXPIRED.

Any chance of getting these in 2.6.32.x?

Note You need to log in before you can comment on or make changes to this bug.