Bug 11154
Summary: | NFS hangs with 2.6.25/2.6.26 despite server being reachable | ||
---|---|---|---|
Product: | File System | Reporter: | walken |
Component: | NFS | Assignee: | Trond Myklebust (trondmy) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | alan, bfields, brking, chris, david, douglaswth, franco, ian, kernel, matt, walken |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.25, 2.6.26, 2.6.27 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: |
sysrq-T trace 1
sysrq-T trace 2 alt-sysreq-T dump during NFS hang following resume-from-RAM tcpdump of client resume from S3, showing 5-minutes NFS hang. [PATCH] SUNRPC: Respond promptly to server TCP resets [PATCH] SUNRPC: Fix the setting of xprt->reestablish_timeout when reconnecting |
Description
walken
2008-07-24 02:57:51 UTC
Created attachment 16961 [details]
sysrq-T trace 1
First sysrq-T trace, on K7 based client.
Created attachment 16962 [details]
sysrq-T trace 2
Second sysrq-T trace, on K7 based client.
I just looked at Trond's bug pile and found out about bug 11061. This bug might be a duplicate, please feel free to consolidate if that's the case. Commonalities: * NFS client hangs while the network connection is fine * Things used to work fine with older kernel versions Differences: * In my case, 2.6.25 generally recovers after a while. I've also observed failures with 2.6.26 though, and they never recovered, like what the other poster is seeing. * In my case the failure often occurs after a suspend/resume cycle. Not always though, so the other poster might just observe the (rare for me) case where the failure occurs without suspending. As an observation, in my case I have seen failures at least as far back as 2.6.25.4 for sure (and I believe as far back as the original 2.6.25, but I'm not certain about this point) Hope this helps. timeo=7 on a TCP connection is just not workable. If the timeout occurs in the middle of a transmission, you force the client to disconnect from the server and then reconnect. This is a _slow_ process on TCP. The other thing is that you force another slow start... Please leave timeo at the default of 600, and all will work fine. I want to clarify that I did not set timeo=7 explicitly in either the nfsmount command arguments or the /etc/fstab file. I just tested that my regular mount command leaves timeo at 600 by default, so I suppose it's the nfsmount command (from klibc-utils) in my initrd that sets it to 7. I'll confirm that this evening, and try running with an explicit timeo=600 value. (same with nointr and retrans=3 - I did not set them, looks like nfsmount did). looking at the klibc-utils source (from debian etch), I can confirm that it defaults to NFS3 over TCP with timeo=7. Their code does that in nfsmount/main.c. If this is an issue you should consider telling that to klibc-utils maintainers. As to my issue, I'll try with an explicit timeo=600 and see if that helps. Note however that timeo=7 is also set when the clients are running 2.6.24, and they work fine in that configuration. OK. That sounds like a definite klibc bug. I had the same issue occur with am-utils a few years back, which resulted in the same symptoms that you are reporting. Apart from the fact that it results in significant slowdowns, the fact is that low timeouts are unnecessary for the case of TCP. Unlike UDP, the TCP model provides a guarantee that either the message will be delivered, or you will lose the connection. The only reason why we have a timeout at all is because some servers will continue to drop requests if they run out of resources, and so the convention is that clients must retry (after a suitable delay to allow the server time to recover) if that occurs. Been running 2.6.25.12 with extra timeo=600,retrans=2 options passed to nfsmount. My rootfs mounting options are now rw,noatime,vers=3,rsize=131072,wsize=131072,hard,nointr,nolock,proto=tcp,timeo=600,retrans=2,sec=sys I've seen some NFS freezes with that, though only relatively short ones yet (i.e. less than a minute). Still, I don't see such freezes when running 2.6.24 on the clients... I finally managed to get suspend-to-RAM working on my desktop machine (finally!) and am experiencing the same or a very similar problem when the machine is woken up. The network is up, the server is pingable but most desktop applications are wedged in disk wait, apparently because the NFS client believes the server to be unreachable. After a minute or two, I get an "nfs server not responding" message and, after 5-10 minutes, connectivity is restored with a "server okay" message and then everything moves along. Getting wedged in disk wait is simultaneous with the system waking up from suspend-to-RAM. Client: Debian testing (lenny) 2.6.25-2-686 Server: Debian old stable (sarge) 2.6.8-3-686-smp The client has a local disk which contains the root file system, but everything under /home is automounted from the server. Here are a typical set of mounts: s8:/home/chris /home/chris nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.1.7 0 0 s8:/home/audio /home/audio nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.1.7 0 0 s8:/home/archive /home/archive nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.1.7 0 0 Here are some more or less interesting lines culled from /var/log/messages following a wake-up: Sep 2 19:59:34 rs6 kernel: [125190.392055] Restarting tasks ... done. Sep 2 19:59:33 rs6 kernel: [125190.849968] VFS: busy inodes on changed media. Sep 2 19:59:40 rs6 kernel: [125197.672740] VFS: busy inodes on changed media. Sep 2 20:02:32 rs6 kernel: [125372.133241] nfs: server s8 not responding, still trying Sep 2 20:04:32 rs6 kernel: [125493.961810] nfs: server s8 OK I'll also attach an alt-sysreq-T dump obtained during the interval where the network was clearly up but where NFS was unable to connect. The dump looks a little odd to me, as there are an awful lot of Epiphany browser processes and not much else. There is an automount process caught in disk wait. Running top or ps on the system shows many, many more processes than in the alt-sysreq-T dump, and many caught in disk wait. I don't know enough about the alt-sysreq-T dumps and associated plumbing to understand the discrepancy. Created attachment 17584 [details]
alt-sysreq-T dump during NFS hang following resume-from-RAM
I should add that this situation is 100% reproducible for me: there is always a 5-10 minute window following a resume-from-RAM where NFS is wedged. Hi Chris, As I noted above in the bug, I have a similar setup (desktop machine with suspend to ram) and hit this too (but, not very reproduceably in my case, and also my rootfs is on NFS which does not help with the investigation). Could you try capturing network traces, on both the server and the client side, of what's going on during the resume ? i.e. on the client, before suspending, start: tcpdump -nv host <server> > /tmp/clientside and on the server, before resuming the client: tcpdump -nv host <client> > /tmp/serverside Then maybe we could compare and see what gets NFS confused ???? (I'm assuming that tcpdump will work better on your client than on mine, given that your rootfs is not NFS based) We have seen a similar regression in 2.6.25.16 and 2.6.26.2, with no hangs seen in 2.6.24.7. The NFS clients each mount around ~80 mountpoints spread over a dozen Linux 2.6.25 NFS servers, connected with 10 gigabit Myrinet (in ethernet mode). Occassionally (it might be correlated to NFS load) all the mounts from a particular NFS server on a NFS client will become unresponsive, even though the NFS server is still reachable with ping/ssh etc. This usually only affects one (or perhaps a couple) of the NFS clients. The others clients remain OK, and other mounts on the affected client are OK. After a period of time (5 minutes to perhaps 30 minutes) the connection will eventually return to normal. Mount settings are: echo1-myri:/data6 /data6 nfs rw,vers=3,rsize=1048576,wsize=1048576,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=140.0.40.1 0 0 Looking at NFS traffic on the client/server with tcpdump, there didn't seem to be any requests being sent to the server for an operation such as "ls /data6" on the mountpoint. The 2.6.24 kernel is running fine for now, though we can try some tests with later kernels if that would help. NFS clients are dual quad-core Intel CPUs, x86_64, Fedora 9. NFS servers are running Fedora 8, x86_64, 2.6.25.16. I'm still seeing this with 2.6.27.1. So far with that kernel I've seen the freezes only following an S3 resume, and they have been resolving themselves after no more than 5 minutes. I'll update the bug if I see longer freezes. I noticed that the freezes I see upon S3 resume using a 2.6.27.1 based client are *exactly* 5 minutes long. Also, a tcpdump from the server side (started a few seconds before resuming the client, and ended a few seconds after the end of the freezo) showed the following information: * at 4:40:27, the client sent a few requests (from tcp ports 939 and 763) to the server's NFS port. The server answered each request with an RST packet. * At 04:43:31, the server sent ten pings to the client (I ran the ping command on the server side), which the client answered. * At 04:45:27 (exactly 5 minutes after the client requests were answered with RST packets), the client opens new TCP connections from its ports 939 and 763 to the server's NFS port. At this point the client seems to unfreeze and become usable again. tcpdump log will be attached. I can try to get a log of 2.6.24 too if that helps - my guess would be that the server was sending RST packets just the same, but the client was re-establishing the TCP connections right away. Is the 5 minute delay after the connection receives an RST mandated by the NFS specification, or is it a bug ???? Hope this helps, Created attachment 18394 [details]
tcpdump of client resume from S3, showing 5-minutes NFS hang.
The 5 minute delay matches XS_TCP_MAX_REEST_TO in net/sunrpc/xprtsock.c. Comments there indicate that the reconnect delay is a feature. Looks like the first time the server drops the TCP connection there is a delay of XS_TCP_INIT_REEST_TO = 3 seconds, then this doubles every time until it reaches the maximum of XS_TCP_MAX_REEST_TO = 5 minutes, or until the client decides to drop the connection by itself after an idle period of XS_IDLE_DISC_TO = 5 minutes (which is unlikely to happen for the nfsroot). Note that the timeout doubles even if the last server RST is many hours ago and the socket did successfully reconnect in between. Created attachment 18426 [details]
[PATCH] SUNRPC: Respond promptly to server TCP resets
If the server sends us an RST error while we're in the TCP_ESTABLISHED
state, then that will not result in a state change, and so the RPC client
ends up hanging forever.
We can intercept the reset by setting up an sk->sk_error_report callback,
which will then allow us to initiate a proper shutdown and retry...
We also make sure that if the send request receives an ECONNRESET, then we
shutdown too...
Created attachment 18427 [details]
[PATCH] SUNRPC: Fix the setting of xprt->reestablish_timeout when reconnecting
If the server aborts an established connection, then we should retry
connecting immediately. Since xprt->reestablish_timeout is not reset unless
we go through a TCP_FIN_WAIT1 state, we may end up waiting unnecessarily.
The fix is to reset xprt->reestablish_timeout in TCP_ESTABLISHED, and then
rely on the fact that we set it to non-zero values in all other cases when
the server closes the connection.
Also fix a race between xs_connect() and xs_tcp_state_change(). The
value of xprt->reestablish_timeout should be updated before we actually
attempt the connection.
This bug was also reported in the Gentoo Bugzilla: http://bugs.gentoo.org/show_bug.cgi?id=244976 Just a reminder: trond's second patch has not made its way into 2.6.29-rc1 yet... See also my patch from bug 11061 comment 19... I was experiencing this problem with OpenSuSE 11.0 as well. After applying both of Trond's patches I no longer experience the hang. Marking as CLOSED. If anyone sees a return of the above problem, then please reopen. |