Bug 16494

Summary: NFS client over TCP hangs due to packet loss
Product: Networking Reporter: andyc.bluearc
Component: IPV4Assignee: Stephen Hemminger (stephen)
Severity: normal CC: akpm, alan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Tree: Mainline
Regression: Yes
Attachments: Abort SUNRPC connection if it's still in shutdown state when it's reused.

Description andyc.bluearc 2010-08-02 16:14:43 UTC
If there's sufficient packet loss over a TCP connection from the NFS client code to an NFS server (using NFS v3) that the RPC client code institutes recovery by shutting down the connection and then reestablishing the connection, then we see repeated connection setup and teardowns without any intervening data packets:

4	42.909478	TCP	1013 > nfs [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=108490 TSER=0 WS=0
5	42.909577	TCP	nfs > 1013 [SYN, ACK] Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
6	42.909610	TCP	1013 > nfs [ACK] Seq=1 Ack=1 Win=5840 Len=0
7	42.909672	TCP	1013 > nfs [FIN, ACK] Seq=1 Ack=1 Win=5840 Len=0
8	42.909767	TCP	nfs > 1013 [ACK] Seq=1 Ack=2 Win=64240 Len=0
9	43.660083	TCP	nfs > 1013 [FIN, ACK] Seq=1 Ack=2 Win=64240 Len=0
10	43.660100	TCP	1013 > nfs [ACK] Seq=2 Ack=2 Win=5840 Len=0

and then repeats after a while.

Here's a link to what I think the problem is: http://lkml.org/lkml/2010/7/27/42

Essentially, tcp_sendmsg is breaking out here as sk_shutdown contains SEND_SHUTDOWN:

         err = -EPIPE;
         if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
                 goto out_err;

Here's a patch that fixes the hang. It clears the sk_shutdown flag at connection init time:

--- /home/company/software/src/linux-     2010-07-27 08:46:46.917000000 +0100
+++ net/ipv4/tcp_output.c       2010-07-27 09:19:16.000000000 +0100
@@ -2522,6 +2522,13 @@
        struct tcp_sock *tp = tcp_sk(sk);
        __u8 rcv_wscale;
+       /* clear down any previous shutdown attempts so that
+        * reconnects on a socket that's been shutdown leave the
+        * socket in a usable state (otherwise tcp_sendmsg() returns
+        * -EPIPE).
+        */
+       sk->sk_shutdown = 0;
        /* We'll fix this up when we get a response from the other end.
         * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.

Whether that's the correct fix, I don't know.

At the time of writing, the current state of the thread in the LKML is here: http://lkml.org/lkml/2010/7/29/120.
Comment 1 Andrew Morton 2010-08-02 19:42:33 UTC
Please don't send patches via bugzilla - it causes lots of problems with
our usual patch management and review processes.

Please send this patch via email as per Documentation/SubmittingPatches. 
Suitable recipients may be found via scripts/get_maintainer.pl.  Please
also cc myself on the email.

Fort hsi one I'd suggest cc'ing netdev@vger.kernel.org and linux-nfs@vger.kernel.org at least.

Comment 2 andyc.bluearc 2010-08-03 08:45:26 UTC
patch submitted: http://lkml.org/lkml/2010/8/3/91
Comment 3 andyc.bluearc 2010-08-03 09:07:47 UTC
This problem also affects 2.6.32 and 2.6.30 series kernels. We have not seen such a hang with 2.6.26.
Comment 4 andyc.bluearc 2010-08-03 10:01:34 UTC
FWIW I've found it easier to reproduce this problem if Ethernet flow control is off but it still happens with it on as well

This is how I reproduce the problem.

If I do this in 4 different xterm windows having cd to the same NFS mounted directory:

xterm1: rm -rf *
xterm2: while true; do     let iter+=1;     echo $iter;     dd if=/dev/zero of=$$ bs=1M count=1000; done
xterm3: while true; do     let iter+=1;     echo $iter;     dd if=/dev/zero of=$$ bs=1M count=1000; done
xterm4: while true; do     let iter+=1;     echo $iter;     dd if=/dev/zero of=$$ bs=1M count=1000; done

then it normally hangs before the 3rd iteration starts. The directory contains loads of information (eg 5 linux source trees).

This happens with different types of Ethernet hardware too. The rm -rf isn't necessary but makes the problem easier to reproduce (for me anyway).
Comment 5 andyc.bluearc 2010-08-05 11:18:26 UTC
Changing to regression as this can not be reproduced on 2.6.26 series kernels.

I've reproduced the problem on
Comment 6 andyc.bluearc 2010-08-09 07:52:07 UTC
Created attachment 27384 [details]
Abort SUNRPC connection if it's still in shutdown state when it's reused.

The sk_shutdown flag was left set on a socket thus causing tcp_sendmsg() to return an error thus causing the RPC layer to attempt to repeat recovery. The patch detects this situation and causes the connection to be aborted if sk_shutdown is set when a connection is being reused.