Bug 13034
Summary: | printk in xs_tcp_setup_socket needs rate limit ... and delay | ||
---|---|---|---|
Product: | File System | Reporter: | Petr Vandrovec (petr) |
Component: | NFS | Assignee: | Trond Myklebust (trondmy) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | akpm, harsha, jdelvare |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.29-ge0724bf6e4 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
SUNRPC: Fix the problem of EADDRNOTAVAIL syslog floods on reconnect
RPC debug messages |
Description
Petr Vandrovec
2009-04-07 03:14:27 UTC
No. This was very done deliberately in order to catch all unforeseen errors, so that we can figure out if (and how) we need to handle them. If this condition lasted for > 60 seconds, then it is unlikely that delay+retry would help. Is this in any way reproducible? I'd love to investigate what happened. Well, maybe delay would not help to recover, but it would fix problem that box was unusable. No, I hit it only once, without doing anything special (mounted local filesystem, and when I was typing another command, screen suddenly exploded with xs_tcp_setup_socket messages. I've tried to reproduce it for past half hour, but I cannot. Only thing I could reproduce is that due to amount of NFS mounts I have (59) it seems that after 4 cycles of 'umount -a -t nfs; mount -a -t nfs' kernel runs out of ports below 1024, and I get quite unspecific 'mount.nfs: mount system call failed' (with no error in the dmesg). I'm not sure whether these hundreds of connections in TIME_WAIT might have something to do with receiving EADDRNOTAVAIL - as after all I do not remember doing any NFS remounting yesterday... I'll for sure tell you when I'll find some reproducible scenario. I've just hit this bug with kernel 2.6.30-rc1-git3, doing "ls -l" in $HOME. It happens that in $HOME I have a link to a NFS-mounted directory, and I guess that "ls -l" checks if links have a valid target or not, so it must have triggered the NFS call which in turn flooded my logs and killed my system. It happened again, apparently today at 7:47. Fortunately this time I had watchdog disabled, so box is still in this state - except that I had to kill klogd as it filled my filesystem :-( Apr 11 07:11:49 petr-dev3 -- MARK -- Apr 11 07:31:49 petr-dev3 -- MARK -- Apr 11 07:47:59 petr-dev3 kernel: xs_tcp_setup_socket: connect returned unhandled error -99 Apr 11 07:47:59 petr-dev3 last message repeated 66 times Apr 11 07:47:59 petr-dev3 kernel: xs_tcocket: connect returned unhandled error -99 Unmount of all nfs filesystems except these two succeeded, and messages are still being logged, so I guess that one of these is filesystem which triggered problem: bugs:/bugs /bugs nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.17.4.29,mountvers=3,mountproto=tcp,addr=10.17.4.29 0 0 cm-home:/home /mts-cm/home nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.17.193.167,mountvers=3,mountproto=tcp,addr=10.17.193.167 0 0 As /bugs mountpoint is responding, while /mts-cm/home is not, I guess this is problematic mountpoint... There seems to be TIME_WAIT connection to the server: petr-dev3:/var/log# host -a cm-home cm-home.eng.vmware.com A 10.17.193.167 petr-dev3:/var/log# netstat -atn |grep 193 tcp 0 0 10.20.86.93:673 10.17.193.167:2049 TIME_WAIT petr-dev3:/var/log# Despite millions messages almost no traffic is sent to 10.17.193.167 - apparently each time TIME_WAIT expires one connection attempt is being made (which is immediately closed), without any effort to do anything useful: petr-dev3:/var/log# tcpdump -nieth0 host 10.17.193.167 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes 18:11:39.134210 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [S], seq 2851076836, win 5840, options [mss 1460,sackOK,TS val 33610789 ecr 0,nop,wscale 6], length 0 18:11:39.209652 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [S.], seq 2082338817, ack 2851076837, win 5640, options [mss 1410,nop,nop,sackOK,nop,wscale 3], length 0 18:11:39.209706 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [.], ack 1, win 92, length 0 18:11:39.209735 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [F.], seq 1, ack 1, win 92, length 0 18:11:39.210963 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [.], ack 2, win 707, length 0 18:11:39.284240 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [F.], seq 1, ack 2, win 707, length 0 18:11:39.284259 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [.], ack 2, win 92, length 0 18:12:39.284244 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [S], seq 2851082496, win 5840, options [mss 1460,sackOK,TS val 33616804 ecr 0,nop,wscale 6], length 0 18:12:39.359241 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [S.], seq 2098018817, ack 2851082497, win 5640, options [mss 1410,nop,nop,sackOK,nop,wscale 3], length 0 18:12:39.359313 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [.], ack 1, win 92, length 0 18:12:39.359343 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [F.], seq 1, ack 1, win 92, length 0 18:12:39.360770 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [.], ack 2, win 707, length 0 18:12:39.446366 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [F.], seq 1, ack 2, win 707, length 0 18:12:39.446392 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [.], ack 2, win 92, length 0 18:13:39.444218 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [S], seq 2851088156, win 5840, options [mss 1460,sackOK,TS val 33622820 ecr 0,nop,wscale 6], length 0 18:13:39.519733 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [S.], seq 2114018817, ack 2851088157, win 5640, options [mss 1410,nop,nop,sackOK,nop,wscale 3], length 0 18:13:39.519772 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [.], ack 1, win 92, length 0 18:13:39.519898 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [F.], seq 1, ack 1, win 92, length 0 18:13:39.521233 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [.], ack 2, win 707, length 0 18:13:39.594523 IP 10.17.193.167.2049 > 10.20.86.93.673: Flags [F.], seq 1, ack 2, win 707, length 0 18:13:39.594549 IP 10.20.86.93.673 > 10.17.193.167.2049: Flags [.], ack 2, win 92, length 0 During that 60 seconds several tens of thousands messages about error -99 are logged. Unfortunately thanks to this logging it is next to impossible to obtain any backtraces. Only thing I can provide is that rpciod/2 and updatedb.mlocate are both in 'R' state: <4>runnable tasks: <4> task PID tree-key switches prio exec-runtime sum-exec sum-sleep <4>---------------------------------------------------------------------------------------------------------- <4>R rpciod/2 6550 14393364.736982 183489490 115 xs_tcp_setup_socket: connect returned unhandled error -99 <4> 14393364.736982 2524989.383678 333501347.862356 <4>xs_tcp_setup_socket: connect returned unhandled error -99 <4>xs_tcp_setup_socket: connect returned unhandled error -99 <4>Rupdatedb.mlocat 24174 14393423.841187 1668915496 120 14393423.841187 16836465.157977 263710.386847 <4> I guess that updatedb.mlocate is one which triggered problem. Unfortunately even with 1MB printk buffer it is completely filled with garbage before I can capture its contents, so I cannot provide backtrace for either of the processes :-( FYI, this is trace with rpc_debug set to 65535. First part (with error 99) repeats several thousand times a second: xs_tcp_setup_socket: connect returned unhandled error -99 RPC: 5256 xprt_connect_status: retrying RPC: 5256 call_connect_status (status -11) RPC: 5256 call_transmit (status 0) RPC: 5256 xprt_prepare_transmit RPC: 5256 rpc_xdr_encode (status 0) RPC: 5256 marshaling UNIX cred ffff88021809f6c0 RPC: 5256 using AUTH_UNIX cred ffff88021809f6c0 to wrap rpc data RPC: 5256 xprt_transmit(116) RPC: xs_tcp_send_request(116) = -32 RPC: xs_tcp_state_change client ffff88021ec7c000... RPC: state 7 conn 0 dead 0 zapped 1 RPC: disconnected transport ffff88021ec7c000 RPC: 5256 call_status (status -32) RPC: 5256 call_bind (status 0) RPC: 5256 call_connect xprt ffff88021ec7c000 is not connected RPC: 5256 xprt_connect xprt ffff88021ec7c000 is not connected RPC: 5256 sleep_on(queue "xprt_pending" time 4339130741) RPC: 5256 added to queue ffff88021ec7c410 "xprt_pending" RPC: 5256 setting alarm for 60000 ms RPC: xs_connect delayed xprt ffff88021ec7c000 for 0 seconds RPC: worker connecting xprt ffff88021ec7c000 to address: addr=10.17.193.167 port=0 proto=tcp RPC: ffff88021ec7c000 connect status 99 connected 0 sock state 7 RPC: xs_tcp_state_change client ffff88021ec7c000... RPC: state 7 conn 0 dead 0 zapped 1 RPC: disconnected transport ffff88021ec7c000 RPC: 5256 __rpc_wake_up_task (now 4339130741) RPC: 5256 disabling timer RPC: 5256 removed from queue ffff88021ec7c410 "xprt_pending" RPC: __rpc_wake_up_task done Once every 60 seconds connect returns EINPROGRESS, and in such case this is generated: xs_tcp_setup_socket: connect returned unhandled error -99 RPC: 5256 xprt_connect_status: retrying RPC: 5256 call_connect_status (status -11) RPC: 5256 call_transmit (status 0) RPC: 5256 xprt_prepare_transmit RPC: 5256 rpc_xdr_encode (status 0) RPC: 5256 marshaling UNIX cred ffff88021809f6c0 RPC: 5256 using AUTH_UNIX cred ffff88021809f6c0 to wrap rpc data RPC: 5256 xprt_transmit(116) RPC: xs_tcp_send_request(116) = -32 RPC: xs_tcp_state_change client ffff88021ec7c000... RPC: state 7 conn 0 dead 0 zapped 1 RPC: disconnected transport ffff88021ec7c000 RPC: 5256 call_status (status -32) RPC: 5256 call_bind (status 0) RPC: 5256 call_connect xprt ffff88021ec7c000 is not connected RPC: 5256 xprt_connect xprt ffff88021ec7c000 is not connected RPC: 5256 sleep_on(queue "xprt_pending" time 4339130741) RPC: 5256 added to queue ffff88021ec7c410 "xprt_pending" RPC: 5256 setting alarm for 60000 ms RPC: xs_connect delayed xprt ffff88021ec7c000 for 0 seconds RPC: worker connecting xprt ffff88021ec7c000 to address: addr=10.17.193.167 port=0 proto=tcp RPC: ffff88021ec7c000 connect status 115 connected 0 sock state 2 RPC: 5256 sync task going to sleep RPC: xs_tcp_state_change client ffff88021ec7c000... RPC: state 1 conn 0 dead 0 zapped 1 RPC: 5256 __rpc_wake_up_task (now 4339130748) RPC: 5256 disabling timer RPC: 5256 removed from queue ffff88021ec7c410 "xprt_pending" RPC: __rpc_wake_up_task done RPC: 5256 sync task resuming RPC: 5256 xprt_connect_status: retrying RPC: 5256 call_connect_status (status -11) RPC: 5256 call_transmit (status 0) RPC: 5256 xprt_prepare_transmit RPC: 5256 rpc_xdr_encode (status 0) RPC: 5256 marshaling UNIX cred ffff88021809f6c0 RPC: 5256 using AUTH_UNIX cred ffff88021809f6c0 to wrap rpc data RPC: 5256 xprt_transmit(116) RPC: xs_tcp_send_request(116) = -32 RPC: xs_tcp_state_change client ffff88021ec7c000... RPC: state 4 conn 1 dead 0 zapped 1 RPC: 5256 call_status (status -32) RPC: 5256 call_bind (status 0) RPC: 5256 call_connect xprt ffff88021ec7c000 is not connected RPC: 5256 xprt_connect xprt ffff88021ec7c000 is not connected RPC: 5256 sleep_on(queue "xprt_pending" time 4339130748) RPC: 5256 added to queue ffff88021ec7c410 "xprt_pending" RPC: 5256 setting alarm for 60000 ms RPC: 5256 sync task going to sleep RPC: xs_tcp_state_change client ffff88021ec7c000... RPC: state 5 conn 0 dead 0 zapped 1 Then it goes back to repeated -99. OK... It's very odd that netstat is reporting the socket as being in a TIME_WAIT state, given the above xs_tcp_state_change() messages that put it in TCP_CLOSE, TCP_ESTABLISHED, TCP_FIN_WAIT1, and later TCP_FIN_WAIT2... It is also odd that the above call to xs_tcp_send_request() is returning EPIPE, when xs_tcp_state_change() has just told us that the socket state is TCP_ESTABLISHED... I just hit it again, running 2.6.30-rc2-git7. I also see a big slowdown in similar conditions when running 2.6.29.1, but without the log flood so it is less problematic. Trond, what can be done to help troubleshoot this issue? The current situation is a pain, having to reboot each time the problem happens is really inconvenient. If you aren't going to rate limit the printk then you need to provide a fix rather sooner than later. Created attachment 21074 [details]
SUNRPC: Fix the problem of EADDRNOTAVAIL syslog floods on reconnect
Please check if this fixes the problem...
I'll test this patch, thanks. I'll let you know how it goes. As I tend to hit the bug every other day or so, it will take some time before I can say for sure that the bug is fixed (if it is.) It seems that now it uses new fresh socket when server closes connection. Let's see whether it survives night. tcp 0 0 10.20.86.93:877 10.18.76.70:2049 ESTABLISHED tcp 0 0 10.20.86.93:878 10.18.76.70:2049 TIME_WAIT tcp 0 0 10.20.86.93:880 10.18.76.80:2049 ESTABLISHED tcp 0 0 10.20.86.93:881 10.18.76.80:2049 TIME_WAIT tcp 0 0 10.20.86.93:882 10.18.76.66:2049 ESTABLISHED tcp 0 0 10.20.86.93:883 10.18.76.66:2049 TIME_WAIT tcp 0 0 10.20.86.93:902 10.18.76.82:2049 ESTABLISHED tcp 0 0 10.20.86.93:903 10.18.76.82:2049 TIME_WAIT tcp 0 0 10.20.86.93:923 10.18.76.27:2049 ESTABLISHED tcp 0 0 10.20.86.93:924 10.18.76.27:2049 TIME_WAIT tcp 0 0 10.20.86.93:940 10.18.76.84:2049 ESTABLISHED tcp 0 0 10.20.86.93:941 10.18.76.84:2049 TIME_WAIT tcp 0 0 10.20.86.93:943 10.17.4.47:2049 ESTABLISHED tcp 0 0 10.20.86.93:944 10.17.4.47:2049 TIME_WAIT On Wed, 2009-04-22 at 09:48 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #10 from Petr Vandrovec <petr@vandrovec.name> 2009-04-22 > 09:48:01 --- > It seems that now it uses new fresh socket when server closes connection. > Let's see whether it survives night. > > tcp 0 0 10.20.86.93:877 10.18.76.70:2049 > ESTABLISHED > tcp 0 0 10.20.86.93:878 10.18.76.70:2049 TIME_WAIT > tcp 0 0 10.20.86.93:880 10.18.76.80:2049 > ESTABLISHED > tcp 0 0 10.20.86.93:881 10.18.76.80:2049 TIME_WAIT > tcp 0 0 10.20.86.93:882 10.18.76.66:2049 > ESTABLISHED > tcp 0 0 10.20.86.93:883 10.18.76.66:2049 TIME_WAIT > tcp 0 0 10.20.86.93:902 10.18.76.82:2049 > ESTABLISHED > tcp 0 0 10.20.86.93:903 10.18.76.82:2049 TIME_WAIT > tcp 0 0 10.20.86.93:923 10.18.76.27:2049 > ESTABLISHED > tcp 0 0 10.20.86.93:924 10.18.76.27:2049 TIME_WAIT > tcp 0 0 10.20.86.93:940 10.18.76.84:2049 > ESTABLISHED > tcp 0 0 10.20.86.93:941 10.18.76.84:2049 TIME_WAIT > tcp 0 0 10.20.86.93:943 10.17.4.47:2049 > ESTABLISHED > tcp 0 0 10.20.86.93:944 10.17.4.47:2049 TIME_WAIT What does the equivalent netstat look like on the server side? Unfortunately I have no access to the server (they are some NetApp and EMC storage devices maintained by company IT). It seems that they are all configured same way, so after mount they start timing out connections all at the same moment (after 10 minutes since mount, or something like that), and netstat above is captured when I run 'df' after connections moved from established to time_wait on client side. That TIME_WAIT disappear after 60 seconds, as expected. (Switching to bugzilla email interface, and ccing linux-nfs) On Wed, 2009-04-22 at 20:49 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13034 > > > > > > --- Comment #12 from Petr Vandrovec <petr@vandrovec.name> 2009-04-22 > 20:49:58 --- > Unfortunately I have no access to the server (they are some NetApp and EMC > storage devices maintained by company IT). > > It seems that they are all configured same way, so after mount they start > timing out connections all at the same moment (after 10 minutes since mount, > or > something like that), and netstat above is captured when I run 'df' after > connections moved from established to time_wait on client side. That > TIME_WAIT > disappear after 60 seconds, as expected. So these connections are basically timing out because the systems are idle? (FYI: the NFS convention is that clients are supposed to close the TCP connection if it has been idle for 5 minutes, whereas the servers usually close it if the client has been idle for 6 minutes)... Reply-To: petr@vmware.com Trond Myklebust wrote: > (Switching to bugzilla email interface, and ccing linux-nfs) > > On Wed, 2009-04-22 at 20:49 +0000, bugzilla-daemon@bugzilla.kernel.org > wrote: >> http://bugzilla.kernel.org/show_bug.cgi?id=13034 >> >> >> >> >> >> --- Comment #12 from Petr Vandrovec <petr@vandrovec.name> 2009-04-22 >> 20:49:58 --- >> Unfortunately I have no access to the server (they are some NetApp and EMC >> storage devices maintained by company IT). >> >> It seems that they are all configured same way, so after mount they start >> timing out connections all at the same moment (after 10 minutes since mount, >> or >> something like that), and netstat above is captured when I run 'df' after >> connections moved from established to time_wait on client side. That >> TIME_WAIT >> disappear after 60 seconds, as expected. > > So these connections are basically timing out because the systems are > idle? (FYI: the NFS convention is that clients are supposed to close the > TCP connection if it has been idle for 5 minutes, whereas the servers > usually close it if the client has been idle for 6 minutes)... Yes, they are timing out because system is idle. It seems that it behaves way it should - it is client who closes them, 5 minutes after I run 'df': $ savedq=-1; df > /dev/null; while true; do q="`netstat -atn | grep 2049 | grep TIME_WAIT | wc -l`"; if [ "$q" != "$savedq" ]; then echo -n "$q "; date; savedq="$q"; fi; sleep 1; done 0 Wed Apr 22 14:59:10 PDT 2009 28 Wed Apr 22 15:04:10 PDT 2009 0 Wed Apr 22 15:05:08 PDT 2009 So as far as I can tell with patch everything works as expected. Petr On Wed, 2009-04-22 at 15:22 -0700, Petr Vandrovec wrote:
> Trond Myklebust wrote:
> > So these connections are basically timing out because the systems are
> > idle? (FYI: the NFS convention is that clients are supposed to close the
> > TCP connection if it has been idle for 5 minutes, whereas the servers
> > usually close it if the client has been idle for 6 minutes)...
>
> Yes, they are timing out because system is idle. It seems that it
> behaves way it should - it is client who closes them, 5 minutes after I
> run 'df':
Good! The 5minute idle timeout is the one case where we don't care about
preserving the port number (because there are no outstanding NFS
requests to replay to the server).
Jean, are you seeing the same behaviour (i.e. errors only on idle
timeout), and is the fix working for you?
On Wed, 22 Apr 2009 20:52:57 -0400, Trond Myklebust wrote:
> Good! The 5minute idle timeout is the one case where we don't care about
> preserving the port number (because there are no outstanding NFS
> requests to replay to the server).
>
> Jean, are you seeing the same behaviour (i.e. errors only on idle
> timeout), and is the fix working for you?
I didn't hit the problem since I applied your patch, but that was only
yesterday so it's too early to draw statistically valid conclusions.
It is highly possible that the bug was happening on idle timeout, I am
the only user on this machine and I access the NFS mounts only from
times to times.
Still working here after one week. Time to send patch upstream? I'd just like to hear an ACK from Jean too. No problem here either, I'd say the patch is ripe for upstream. Fixed in mainline as commit number f75e6745aa3084124ae1434fd7629853bdaf6798: SUNRPC: Fix the problem of EADDRNOTAVAIL syslog floods on reconnect Marking bug as CLOSED. Please reopen if the problem reoccurs. Created attachment 39382 [details]
RPC debug messages
Adding this attachment since it is reproducible under fc11 kernel version
2.6.30.10-2.4.fc11.x86_64
Happens when a continuous cp operation is performed on an NFS mount-point, once in a while when NFS server is rebooted and gets back online. rpciod goes bonkers on this fc11 server utilizing 50% cpu and other 50% is used by "cp" operation itself. But we could see on the terminal that i/o is not progressing any more.
Can be reproduced quite well.
(In reply to comment #22) > Created an attachment (id=39382) [details] > RPC debug messages > > Adding this attachment since it is reproducible under fc11 kernel version > > 2.6.30.10-2.4.fc11.x86_64 > > Happens when a continuous cp operation is performed on an NFS mount-point, > once > in a while when NFS server is rebooted and gets back online. rpciod goes > bonkers on this fc11 server utilizing 50% cpu and other 50% is used by "cp" > operation itself. But we could see on the terminal that i/o is not > progressing > any more. > > Can be reproduced quite well. To be more clear i have looked into clnt.c and xprt.c code for the necessary changes which went in to 2.6.30 kernel are present in the fc11 kernel server. Can this be reproduced with a recent kernel? FC11 community support expired more than 6 months ago, and 2.6.30 kernels are almost 2 years old. (In reply to comment #24) > Can this be reproduced with a recent kernel? FC11 community support expired > more than 6 months ago, and 2.6.30 kernels are almost 2 years old. I would try, but as i saw the code base didn't change much for sunrpc since 2.6.30 - recent 2.6.36.1 so thought of reporting it nevertheless. I will report back if i am able to reproduce it with new kernel. |