If there's sufficient packet loss over a TCP connection from the NFS client code to an NFS server (using NFS v3) that the RPC client code institutes recovery by shutting down the connection and then reestablishing the connection, then we see repeated connection setup and teardowns without any intervening data packets:
4 42.909478 172.18.0.39 10.1.6.102 TCP 1013 > nfs [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=108490 TSER=0 WS=0
5 42.909577 10.1.6.102 172.18.0.39 TCP nfs > 1013 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
6 42.909610 172.18.0.39 10.1.6.102 TCP 1013 > nfs [ACK] Seq=1 Ack=1 Win=5840 Len=0
7 42.909672 172.18.0.39 10.1.6.102 TCP 1013 > nfs [FIN, ACK] Seq=1 Ack=1 Win=5840 Len=0
8 42.909767 10.1.6.102 172.18.0.39 TCP nfs > 1013 [ACK] Seq=1 Ack=2 Win=64240 Len=0
9 43.660083 10.1.6.102 172.18.0.39 TCP nfs > 1013 [FIN, ACK] Seq=1 Ack=2 Win=64240 Len=0
10 43.660100 172.18.0.39 10.1.6.102 TCP 1013 > nfs [ACK] Seq=2 Ack=2 Win=5840 Len=0
and then repeats after a while.
Here's a link to what I think the problem is: http://lkml.org/lkml/2010/7/27/42
Essentially, tcp_sendmsg is breaking out here as sk_shutdown contains SEND_SHUTDOWN:
err = -EPIPE;
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
Here's a patch that fixes the hang. It clears the sk_shutdown flag at connection init time:
--- /home/company/software/src/linux-22.214.171.124/net/ipv4/tcp_output.c 2010-07-27 08:46:46.917000000 +0100
+++ net/ipv4/tcp_output.c 2010-07-27 09:19:16.000000000 +0100
@@ -2522,6 +2522,13 @@
struct tcp_sock *tp = tcp_sk(sk);
+ /* clear down any previous shutdown attempts so that
+ * reconnects on a socket that's been shutdown leave the
+ * socket in a usable state (otherwise tcp_sendmsg() returns
+ * -EPIPE).
+ sk->sk_shutdown = 0;
/* We'll fix this up when we get a response from the other end.
* See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
Whether that's the correct fix, I don't know.
At the time of writing, the current state of the thread in the LKML is here: http://lkml.org/lkml/2010/7/29/120.
Please don't send patches via bugzilla - it causes lots of problems with
our usual patch management and review processes.
Please send this patch via email as per Documentation/SubmittingPatches.
Suitable recipients may be found via scripts/get_maintainer.pl. Please
also cc myself on the email.
Fort hsi one I'd suggest cc'ing email@example.com and firstname.lastname@example.org at least.
patch submitted: http://lkml.org/lkml/2010/8/3/91
This problem also affects 2.6.32 and 2.6.30 series kernels. We have not seen such a hang with 2.6.26.
FWIW I've found it easier to reproduce this problem if Ethernet flow control is off but it still happens with it on as well
This is how I reproduce the problem.
If I do this in 4 different xterm windows having cd to the same NFS mounted directory:
xterm1: rm -rf *
xterm2: while true; do let iter+=1; echo $iter; dd if=/dev/zero of=$$ bs=1M count=1000; done
xterm3: while true; do let iter+=1; echo $iter; dd if=/dev/zero of=$$ bs=1M count=1000; done
xterm4: while true; do let iter+=1; echo $iter; dd if=/dev/zero of=$$ bs=1M count=1000; done
then it normally hangs before the 3rd iteration starts. The directory contains loads of information (eg 5 linux source trees).
This happens with different types of Ethernet hardware too. The rm -rf isn't necessary but makes the problem easier to reproduce (for me anyway).
Changing to regression as this can not be reproduced on 2.6.26 series kernels.
I've reproduced the problem on 126.96.36.199.
Created attachment 27384 [details]
Abort SUNRPC connection if it's still in shutdown state when it's reused.
The sk_shutdown flag was left set on a socket thus causing tcp_sendmsg() to return an error thus causing the RPC layer to attempt to repeat recovery. The patch detects this situation and causes the connection to be aborted if sk_shutdown is set when a connection is being reused.