Bug 11154 - NFS hangs with 2.6.25/2.6.26 despite server being reachable
Summary: NFS hangs with 2.6.25/2.6.26 despite server being reachable
Status: CLOSED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks: 7216
  Show dependency tree
 
Reported: 2008-07-24 02:57 UTC by walken
Modified: 2010-02-03 21:02 UTC (History)
11 users (show)

See Also:
Kernel Version: 2.6.25, 2.6.26, 2.6.27
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
sysrq-T trace 1 (28.92 KB, text/plain)
2008-07-24 02:58 UTC, walken
Details
sysrq-T trace 2 (27.95 KB, text/plain)
2008-07-24 02:59 UTC, walken
Details
alt-sysreq-T dump during NFS hang following resume-from-RAM (45.04 KB, text/plain)
2008-09-02 18:11 UTC, Chris Newbold
Details
tcpdump of client resume from S3, showing 5-minutes NFS hang. (42.81 KB, text/plain)
2008-10-21 05:28 UTC, walken
Details
[PATCH] SUNRPC: Respond promptly to server TCP resets (4.31 KB, patch)
2008-10-24 10:27 UTC, Trond Myklebust
Details | Diff
[PATCH] SUNRPC: Fix the setting of xprt->reestablish_timeout when reconnecting (2.54 KB, patch)
2008-10-24 10:28 UTC, Trond Myklebust
Details | Diff

Description walken 2008-07-24 02:57:51 UTC
Latest working kernel version: 2.6.24
Earliest failing kernel version: 2.6.25
Distribution: debian etch
Problem Description:

I'm getting frequent NFS hangs when running 2.6.25 or 2.6.26 on my NFS clients, while 2.6.24 seems to work fine. The hangs most commonly (but not always) happen after a suspend/resume cycle on the client. The client is pingable from the server during the hangs, but the clients can't access their NFS mounted rootfs. 2.6.25 based clients always recover after some time (varies from just seconds to 5 or 10 minutes). 2.6.26 based clients apparently never recover.

Here are a few more details about my setup:

* The NFS server is a linux host (x86-64) running 2.6.22.19;

* I have two diskless clients using different hardware, one is an old
  AMD K7 with intel pro/100 adapter, the other is x86-64 with on-board
  intel e1000e networking.

* The clients boot from a small initrd and mount their rootfs using
  NFS3 over TCP. The initial mounting is done with "nfsmount -o ro,nolock"
  from the klibc-utils package.

* /etc/fstab in the clients describe the nfs root as having the nolock,noatime
  options. The boot scripts (unmodified from my debian etch distro)
  remount the root using these. In the end, /proc/mounts shows the rootfs as
  using the following options:
  When running 2.6.24 on the client:
    rw,noatime,vers=3,rsize=131072,wsize=131072,hard,nointr,
    nolock,proto=tcp,timeo=7,retrans=3,sec=sys
  When running 2.6.26 on the client:
    rw,noatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,nointr,
    nolock,proto=tcp,timeo=7,retrans=3,sec=sys,mountproto=udp

* I often run suspend-to-ram on the clients, if I think they'll be unused
  for a few hours.

* The issue most often happens after a resume. The clients hang showing the classic symptoms of what would happen if they can not reach their NFS server: load goes to high values, processes can't access the filesystem and get 'stuck' when they try to do so. Some processes which don't access the filesystem, like xload for example, keep going and display high load values. Shell sessions usualy get stuck before I can do anything useful with them.

* I can log on the server and confirm that the clients are pingable from there during the times where they seem stuck on their NFS accesses.

* 2.6.24 based clients do not exhibit this failure. 2.6.25 based clients always recover after a while (can vary from few seconds to 10 minutes or so). 2.6.26 based clients don't seem to recover, as far as I could see (I only let this happen once on each of my clients, so far, and have gone back to 2.6.25 for further experiments).

* I don't suppose the failure to be related to ACPI suspend issues, given that I've observed it on both of my NFS clients with very different hardware.

* I do not fully understand which parameters trigger this failure. I can sometimes run 2.6.25 for a week without hitting it, and then hit it twice in a row the next day... Not a good candidate for a binary search, I think :/


On a suggestion from J. Bruce Fields, I obtained a couple sysrq-T traces of a client during the NFS freeze, attached below. I'm also open to other suggestions about information to collect - I am, however, not sure if I'll be able to try anything in a root shell session during the NFS hangs, given the very limited functionality of the clients during these... :/
Comment 1 walken 2008-07-24 02:58:57 UTC
Created attachment 16961 [details]
sysrq-T trace 1

First sysrq-T trace, on K7 based client.
Comment 2 walken 2008-07-24 02:59:35 UTC
Created attachment 16962 [details]
sysrq-T trace 2

Second sysrq-T trace, on K7 based client.
Comment 3 walken 2008-07-24 03:18:45 UTC
I just looked at Trond's bug pile and found out about bug 11061. This bug might
be a duplicate, please feel free to consolidate if that's the case.

Commonalities:
* NFS client hangs while the network connection is fine
* Things used to work fine with older kernel versions

Differences:
* In my case, 2.6.25 generally recovers after a while. I've also observed
failures with 2.6.26 though, and they never recovered, like what the other
poster is seeing.
* In my case the failure often occurs after a suspend/resume cycle. Not always
though, so the other poster might just observe the (rare for me) case where the
failure occurs without suspending.

As an observation, in my case I have seen failures at least as far back as
2.6.25.4 for sure (and I believe as far back as the original 2.6.25, but I'm
not certain about this point)

Hope this helps.
Comment 4 Trond Myklebust 2008-07-24 08:49:24 UTC
timeo=7 on a TCP connection is just not workable. If the timeout occurs in the
middle of a transmission, you force the client to disconnect from the server
and then reconnect. This is a _slow_ process on TCP. The other thing is that
you force another slow start...

Please leave timeo at the default of 600, and all will work fine.
Comment 5 walken 2008-07-24 14:54:48 UTC
I want to clarify that I did not set timeo=7 explicitly in either the nfsmount command arguments or the /etc/fstab file. I just tested that my regular mount command leaves timeo at 600 by default, so I suppose it's the nfsmount command (from klibc-utils) in my initrd that sets it to 7. I'll confirm that this evening, and try running with an explicit timeo=600 value.

(same with nointr and retrans=3 - I did not set them, looks like nfsmount did).
Comment 6 walken 2008-07-24 15:05:28 UTC
looking at the klibc-utils source (from debian etch), I can confirm that it defaults to NFS3 over TCP with timeo=7. Their code does that in nfsmount/main.c. If this is an issue you should consider telling that to klibc-utils maintainers.

As to my issue, I'll try with an explicit timeo=600 and see if that helps.
Note however that timeo=7 is also set when the clients are running 2.6.24, and they work fine in that configuration.
Comment 7 Trond Myklebust 2008-07-24 15:47:52 UTC
OK. That sounds like a definite klibc bug. I had the same issue occur with
am-utils a few years back, which resulted in the same symptoms that you are
reporting.

Apart from the fact that it results in significant slowdowns, the fact is that
low timeouts are unnecessary for the case of TCP. Unlike UDP, the TCP model
provides a guarantee that either the message will be delivered, or you will
lose the connection. The only reason why we have a timeout at all is because
some servers will continue to drop requests if they run out of resources, and
so the convention is that clients must retry (after a suitable delay to allow
the server time to recover) if that occurs.
Comment 8 walken 2008-07-28 02:57:33 UTC
Been running 2.6.25.12 with extra timeo=600,retrans=2 options passed to nfsmount. My rootfs mounting options are now rw,noatime,vers=3,rsize=131072,wsize=131072,hard,nointr,nolock,proto=tcp,timeo=600,retrans=2,sec=sys

I've seen some NFS freezes with that, though only relatively short ones yet (i.e. less than a minute). Still, I don't see such freezes when running 2.6.24 on the clients...
Comment 9 Chris Newbold 2008-09-02 18:06:26 UTC
I finally managed to get suspend-to-RAM working on my desktop machine (finally!) and am experiencing the same or a very similar problem when the machine is woken up. The network is up, the server is pingable but most desktop applications are wedged in disk wait, apparently because the NFS client believes the server to be unreachable.

After a minute or two, I get an "nfs server not responding" message and, after 5-10 minutes, connectivity is restored with a "server okay" message and then everything moves along.

Getting wedged in disk wait is simultaneous with the system waking up from suspend-to-RAM.

Client: Debian testing (lenny) 2.6.25-2-686
Server: Debian old stable (sarge) 2.6.8-3-686-smp

The client has a local disk which contains the root file system, but everything under /home is automounted from the server. Here are a typical set of mounts:

s8:/home/chris /home/chris nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.1.7 0 0
s8:/home/audio /home/audio nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.1.7 0 0
s8:/home/archive /home/archive nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.1.7 0 0

Here are some more or less interesting lines culled from /var/log/messages following a wake-up:

Sep  2 19:59:34 rs6 kernel: [125190.392055] Restarting tasks ... done.
Sep  2 19:59:33 rs6 kernel: [125190.849968] VFS: busy inodes on changed media.
Sep  2 19:59:40 rs6 kernel: [125197.672740] VFS: busy inodes on changed media.
Sep  2 20:02:32 rs6 kernel: [125372.133241] nfs: server s8 not responding, still trying
Sep  2 20:04:32 rs6 kernel: [125493.961810] nfs: server s8 OK

I'll also attach an alt-sysreq-T dump obtained during the interval where the network was clearly up but where NFS was unable to connect. The dump looks a little odd to me, as there are an awful lot of Epiphany browser processes and not much else. There is an automount process caught in disk wait. Running top or ps on the system shows many, many more processes than in the alt-sysreq-T dump, and many caught in disk wait.

I don't know enough about the alt-sysreq-T dumps and associated plumbing to understand the discrepancy.
Comment 10 Chris Newbold 2008-09-02 18:11:28 UTC
Created attachment 17584 [details]
alt-sysreq-T dump during NFS hang following resume-from-RAM
Comment 11 Chris Newbold 2008-09-02 18:16:08 UTC
I should add that this situation is 100% reproducible for me: there is always a 5-10 minute window following a resume-from-RAM where NFS is wedged.
Comment 12 walken 2008-09-03 10:14:14 UTC
Hi Chris,

As I noted above in the bug, I have a similar setup (desktop machine with suspend to ram) and hit this too (but, not very reproduceably in my case, and also my rootfs is on NFS which does not help with the investigation).

Could you try capturing network traces, on both the server and the client side, of what's going on during the resume ? i.e. on the client, before suspending, start:
tcpdump -nv host <server> > /tmp/clientside
and on the server, before resuming the client:
tcpdump -nv host <client> > /tmp/serverside
Then maybe we could compare and see what gets NFS confused ????

(I'm assuming that tcpdump will work better on your client than on mine, given that your rootfs is not NFS based)
Comment 13 Matt Johnston 2008-09-09 21:52:27 UTC
We have seen a similar regression in 2.6.25.16 and 2.6.26.2, with no hangs seen in 2.6.24.7. The NFS clients each mount around ~80 mountpoints spread over a dozen Linux 2.6.25 NFS servers, connected with 10 gigabit Myrinet (in ethernet mode). Occassionally (it might be correlated to NFS load) all the mounts from a particular NFS server on a NFS client will become unresponsive, even though the NFS server is still reachable with ping/ssh etc. This usually only affects one (or perhaps a couple) of the NFS clients. The others clients remain OK, and other mounts on the affected client are OK. After a period of time (5 minutes to perhaps 30 minutes) the connection will eventually return to normal.

Mount settings are:
echo1-myri:/data6 /data6 nfs rw,vers=3,rsize=1048576,wsize=1048576,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=140.0.40.1 0 0

Looking at NFS traffic on the client/server with tcpdump, there didn't seem to be any requests being sent to the server for an operation such as "ls /data6" on the mountpoint. The 2.6.24 kernel is running fine for now, though we can try some tests with later kernels if that would help.

NFS clients are dual quad-core Intel CPUs, x86_64, Fedora 9.
NFS servers are running Fedora 8, x86_64, 2.6.25.16.
Comment 14 walken 2008-10-20 19:31:52 UTC
I'm still seeing this with 2.6.27.1. So far with that kernel I've seen the freezes only following an S3 resume, and they have been resolving themselves after no more than 5 minutes. I'll update the bug if I see longer freezes.
Comment 15 walken 2008-10-21 05:27:18 UTC
I noticed that the freezes I see upon S3 resume using a 2.6.27.1 based client are *exactly* 5 minutes long. Also, a tcpdump from the server side (started a few seconds before resuming the client, and ended a few seconds after the end of the freezo) showed the following information:

* at 4:40:27, the client sent a few requests (from tcp ports 939 and 763) to the server's NFS port. The server answered each request with an RST packet.

* At 04:43:31, the server sent ten pings to the client (I ran the ping command on the server side), which the client answered.

* At 04:45:27 (exactly 5 minutes after the client requests were answered with RST packets), the client opens new TCP connections from its ports 939 and 763 to the server's NFS port. At this point the client seems to unfreeze and become usable again.

tcpdump log will be attached.

I can try to get a log of 2.6.24 too if that helps - my guess would be that the server was sending RST packets just the same, but the client was re-establishing the TCP connections right away.

Is the 5 minute delay after the connection receives an RST mandated by the NFS specification, or is it a bug ????

Hope this helps,
Comment 16 walken 2008-10-21 05:28:59 UTC
Created attachment 18394 [details]
tcpdump of client resume from S3, showing 5-minutes NFS hang.
Comment 17 walken 2008-10-23 04:59:03 UTC
The 5 minute delay matches XS_TCP_MAX_REEST_TO in net/sunrpc/xprtsock.c.
Comments there indicate that the reconnect delay is a feature.

Looks like the first time the server drops the TCP connection there is a delay of XS_TCP_INIT_REEST_TO = 3 seconds, then this doubles every time until it reaches the maximum of XS_TCP_MAX_REEST_TO = 5 minutes, or until the client decides to drop the connection by itself after an idle period of XS_IDLE_DISC_TO = 5 minutes (which is unlikely to happen for the nfsroot).

Note that the timeout doubles even if the last server RST is many hours ago and the socket did successfully reconnect in between.
Comment 18 Trond Myklebust 2008-10-24 10:27:14 UTC
Created attachment 18426 [details]
[PATCH] SUNRPC: Respond promptly to server TCP resets

If the server sends us an RST error while we're in the TCP_ESTABLISHED
state, then that will not result in a state change, and so the RPC client
ends up hanging forever.

We can intercept the reset by setting up an sk->sk_error_report callback,
which will then allow us to initiate a proper shutdown and retry...

We also make sure that if the send request receives an ECONNRESET, then we
shutdown too...
Comment 19 Trond Myklebust 2008-10-24 10:28:24 UTC
Created attachment 18427 [details]
[PATCH] SUNRPC: Fix the setting of xprt->reestablish_timeout when reconnecting

If the server aborts an established connection, then we should retry
connecting immediately. Since xprt->reestablish_timeout is not reset unless
we go through a TCP_FIN_WAIT1 state, we may end up waiting unnecessarily.
The fix is to reset xprt->reestablish_timeout in TCP_ESTABLISHED, and then
rely on the fact that we set it to non-zero values in all other cases when
the server closes the connection.

Also fix a race between xs_connect() and xs_tcp_state_change(). The
value of xprt->reestablish_timeout should be updated before we actually
attempt the connection.
Comment 20 Nick Loeve 2008-12-11 12:56:43 UTC
This bug was also reported in the Gentoo Bugzilla: http://bugs.gentoo.org/show_bug.cgi?id=244976
Comment 21 walken 2009-01-12 03:44:40 UTC
Just a reminder: trond's second patch has not made its way into 2.6.29-rc1 yet...
Comment 22 Ian Dall 2009-01-14 14:10:31 UTC
See also my patch from bug 11061 comment 19...
Comment 23 Brian King 2009-01-17 07:10:51 UTC
I was experiencing this problem with OpenSuSE 11.0 as well. After applying both of Trond's patches I no longer experience the hang.
Comment 24 Trond Myklebust 2010-02-03 21:02:39 UTC
Marking as CLOSED. If anyone sees a return of the above problem, then please reopen.

Note You need to log in before you can comment on or make changes to this bug.