Bug 10903

Summary: ssh connections hang with 2.6.26-rc5
Product: Networking Reporter: Didier Raboud (didier)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    

Description Didier Raboud 2008-06-13 02:39:16 UTC
Latest working kernel version: 2.6.25-2
Earliest failing kernel version: 2.6.26-rc5
Distribution: Debian (Lenny + Sid)
Hardware Environment: amd64 (Dell Latitude D630)
Software Environment: KDE
Problem Description:

With kernel version 2.6.26-rc5, the ssh connections to remote servers randomly hang (no error message). No amelioration despite the activation of "ServerAliveInterval" on both sides.

Steps to reproduce:

Connect to a remote ssh server and do some stuff. After some time, the connection will hang.

Please ask for details.
Comment 1 Adrian Bunk 2008-06-13 02:56:31 UTC
A 2.6.26-rc networking regression?

Please reply via email.


----- Forwarded message from bugme-daemon@bugzilla.kernel.org -----

Date: Fri, 13 Jun 2008 02:39:17 -0700 (PDT)
From: bugme-daemon@bugzilla.kernel.org
To: bunk@kernel.org
Subject: [Bug 10903] New: ssh connections hang with 2.6.26-rc5

http://bugzilla.kernel.org/show_bug.cgi?id=10903

           Summary: ssh connections hang with 2.6.26-rc5
           Product: Networking
           Version: 2.5
     KernelVersion: 2.6.26-rc5
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Other
        AssignedTo: acme@ghostprotocols.net
        ReportedBy: didier@raboud.com


Latest working kernel version: 2.6.25-2
Earliest failing kernel version: 2.6.26-rc5
Distribution: Debian (Lenny + Sid)
Hardware Environment: amd64 (Dell Latitude D630)
Software Environment: KDE
Problem Description:

With kernel version 2.6.26-rc5, the ssh connections to remote servers randomly
hang (no error message). No amelioration despite the activation of
"ServerAliveInterval" on both sides.

Steps to reproduce:

Connect to a remote ssh server and do some stuff. After some time, the
connection will hang.

Please ask for details.


------- You are receiving this mail because: -------
Comment 2 Anonymous Emailer 2008-06-13 02:58:35 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 13 Jun 2008 02:39:17 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10903
> 
>            Summary: ssh connections hang with 2.6.26-rc5
>            Product: Networking
>            Version: 2.5
>      KernelVersion: 2.6.26-rc5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: acme@ghostprotocols.net
>         ReportedBy: didier@raboud.com
> 
> 
> Latest working kernel version: 2.6.25-2
> Earliest failing kernel version: 2.6.26-rc5
> Distribution: Debian (Lenny + Sid)
> Hardware Environment: amd64 (Dell Latitude D630)
> Software Environment: KDE
> Problem Description:
> 
> With kernel version 2.6.26-rc5, the ssh connections to remote servers
> randomly
> hang (no error message). No amelioration despite the activation of
> "ServerAliveInterval" on both sides.
> 
> Steps to reproduce:
> 
> Connect to a remote ssh server and do some stuff. After some time, the
> connection will hang.
> 
> Please ask for details.
> 
Comment 3 Ilpo Järvinen 2008-06-14 13:45:51 UTC
On Fri, 13 Jun 2008, Andrew Morton wrote:

> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Fri, 13 Jun 2008 02:39:17 -0700 (PDT) bugme-daemon@bugzilla.kernel.org
> wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > 
> >            Summary: ssh connections hang with 2.6.26-rc5
> >            Product: Networking
> >            Version: 2.5
> >      KernelVersion: 2.6.26-rc5
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >         AssignedTo: acme@ghostprotocols.net
> >         ReportedBy: didier@raboud.com
> > 
> > 
> > Latest working kernel version: 2.6.25-2
> > Earliest failing kernel version: 2.6.26-rc5
> > Distribution: Debian (Lenny + Sid)
> > Hardware Environment: amd64 (Dell Latitude D630)
> > Software Environment: KDE
> > Problem Description:
> > 
> > With kernel version 2.6.26-rc5, the ssh connections to remote servers 
> > randomly 
> > hang (no error message). No amelioration despite the activation of
> > "ServerAliveInterval" on both sides.

Thanks for reporting. Could you please clarify couple of things:

Does this only happen with a particular server/servers?
Any middleboxes in between (NAT, firewall, etc.)?
Do all ssh connections hang simultaneously?
How long have you waited until concluding that TCP is "hung"?
Is TSO enabled (ethtool -k)? Have you tried without it?
It wouldn't hurt to include info about eth hw too (e.g., lspci), though 
it might turn unneeded at some point of time but it might save an email 
round-trip.

TCP can appear to hang due to vast number of reasons. Only recent changes 
that are suspectable is the DEFERRED_ACCEPT thing which is already 
reverted in the very latest Linus' tree (even -rc6 is too old for that) 
and few FRTO fixes (you can exclude FRTO by turning 
/proc/sys/net/ipv4/tcp_frto sysctl to 0 but it seems quite unlikely to 
change anything); your problem might well come from something else and TCP 
hang is just a symptom of other problem downstream.

So please gather this information (at least for the relevant connections):

$ netstat -pn
$ cat /proc/net/tcp

...Also a tcpdump might be handy (though I don't know yet).

...Depending on your privacy needs, you may want obfuscate ip addresses 
that are revealed by all of those logs (ie., if you don't want to reveal 
with whom you're communicating with, ssh payload is encrypted anyway).

> > Steps to reproduce:
> > 
> > Connect to a remote ssh server and do some stuff. After some time, the
> > connection will hang.
> > 
> > Please ask for details.

(I'll be away nearly a month after Tuesday, so I probably won't have much 
time to resolve this issue but I hope I've some time to take a look before 
I leave).
Comment 4 Didier Raboud 2008-06-15 06:37:33 UTC
Le samedi 14 juin 2008 22:45:41 Ilpo J
Comment 5 Ilpo Järvinen 2008-06-16 06:22:03 UTC
On Sun, 15 Jun 2008, Didier Raboud wrote:

> Le samedi 14 juin 2008 22:45:41 Ilpo J
Comment 6 Didier Raboud 2008-06-17 15:01:59 UTC
Le lundi 16 juin 2008 15:21:25 Ilpo Järvinen, vous avez écrit :
> > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > >
> > > > >            Summary: ssh connections hang with 2.6.26-rc5
> > > > >            Product: Networking
> > > > >            Version: 2.5
> > > > >      KernelVersion: 2.6.26-rc5
> > > > >           Platform: All
> > > > >         OS/Version: Linux
> > > > >               Tree: Mainline
> > > > >             Status: NEW
> > > > >           Severity: normal
> > > > >           Priority: P1
> > > > >          Component: Other
> > > > >         AssignedTo: acme@ghostprotocols.net
> > > > >         ReportedBy: didier@raboud.com
> > > > >
> > > > >
> > > > > Latest working kernel version: 2.6.25-2
> > > > > Earliest failing kernel version: 2.6.26-rc5
> > > > > Distribution: Debian (Lenny + Sid)
> > > > > Hardware Environment: amd64 (Dell Latitude D630)
> > > > > Software Environment: KDE
> > > > > Problem Description:
> > > > >
> > > > > With kernel version 2.6.26-rc5, the ssh connections to remote
> > > > > servers randomly
> > > > > hang (no error message). No amelioration despite the activation of
> > > > > "ServerAliveInterval" on both sides.
> > > (...)
> > 
> > The common point is my use of "iwl3945" : I have always tried the ssh
> > connections through WiFi.
> > 
> > > So please gather this information (at least for the relevant
> > > connections):
> > >
> > > $ netstat -pn
> > > $ cat /proc/net/tcp

Hi,  

I used a script which logged both every 15 seconds on both sides (in "screen" 
on server side). I then triggered a hang with several "seq 1 100" in the ssh 
session. The logs are in the attached debug_ssh_hang.tar.gz . The hang 
appeared between 234327 and 234342 on the client side (so somewhere between 
234324, 234339 and 234354 on the server side).

I hope it'll help.

> You probably run it under X, no? Please switch beforehand to some other vt
> (a textual one) then (Ctrl-Alt-Fn, where n < 6) and then log in and
> running that command there and see if you get some output into screen
> there. If you see something (e.g., a sudden OOPS message or some other
> warning printed) when it locks up, the easiest things is to take a shot
> with a digicam (or write it down somewhere else) and send that shot (or
> those details) to us please.

I tried the following in vt1 (under the new 2.6.26-rc6 with kdm stopped):

# tcpdump -i wlan0 -w /tmp/tcpdump.wlan0

and I got the attached "soft lockup".

> ...Once you have a tcpdump, I can probably figure at least something out
> (though it might still just point to the right direction rather than
> exposing the actual cause).

I can't get one... :)

Regards, 

OdyX
Comment 7 Ilpo Järvinen 2008-06-17 16:05:08 UTC
On Tue, 17 Jun 2008, Didier Raboud wrote:

> Le lundi 16 juin 2008 15:21:25 Ilpo J
Comment 8 Johannes Berg 2008-06-18 00:25:38 UTC
> > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > > > >
> > > > > > >            Summary: ssh connections hang with 2.6.26-rc5

Andrew Prince reported a similar problem and said he bisected it  to
davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use
skb_header_cloned() on TX path.") which made no sense to me so I marked
the report as 'to investigate when I have more time'.

> > > You probably run it under X, no? Please switch beforehand to some other
> vt
> > > (a textual one) then (Ctrl-Alt-Fn, where n < 6) and then log in and
> > > running that command there and see if you get some output into screen
> > > there. If you see something (e.g., a sudden OOPS message or some other
> > > warning printed) when it locks up, the easiest things is to take a shot
> > > with a digicam (or write it down somewhere else) and send that shot (or
> > > those details) to us please.
> > 
> > I tried the following in vt1 (under the new 2.6.26-rc6 with kdm stopped):
> > 
> > # tcpdump -i wlan0 -w /tmp/tcpdump.wlan0
> > 
> > and I got the attached "soft lockup".

Attached where? I can't find it in the bug either.

johannes
Comment 9 David S. Miller 2008-06-18 01:05:59 UTC
From: Johannes Berg <johannes@sipsolutions.net>
Date: Wed, 18 Jun 2008 09:24:47 +0200

> 
> > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > > > > >
> > > > > > > >            Summary: ssh connections hang with 2.6.26-rc5
> 
> Andrew Prince reported a similar problem and said he bisected it  to
> davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use
> skb_header_cloned() on TX path.") which made no sense to me so I marked
> the report as 'to investigate when I have more time'.

That's useful information.  The kernel bugzilla entry is for
the iwl3945 driver, so that matches up accurately with this
too.

If we can't figure out what's going on here soon (like, in less than a
day) we should revert that changeset.

Actually, I think I see how the changeset might be wrong.  I think
the encryption layer of mac80211 assumes it can write over the
data area of the SKB it's working on, not just the headers.

Once this happens, any retransmits done by SKB will fail because the
master packet data on TCP's retransmit queue is now this encrypted
garbage.
Comment 10 David S. Miller 2008-06-18 01:27:00 UTC
From: David Miller <davem@davemloft.net>
Date: Wed, 18 Jun 2008 01:05:28 -0700 (PDT)

> If we can't figure out what's going on here soon (like, in less than a
> day) we should revert that changeset.
> 
> Actually, I think I see how the changeset might be wrong.  I think
> the encryption layer of mac80211 assumes it can write over the
> data area of the SKB it's working on, not just the headers.
> 
> Once this happens, any retransmits done by SKB will fail because the
> master packet data on TCP's retransmit queue is now this encrypted
> garbage.

After some discussion about this with Johannes on IRC, we are
absolutely convinced this is exactly the problem.

I intend to send the following revert to Linus tonight so we
can close this:

--------------------
Revert "mac80211: Use skb_header_cloned() on TX path."

This reverts commit 608961a5eca8d3c6bd07172febc27b5559408c5d.

The problem is that the mac80211 stack not only needs to be able to
muck with the link-level headers, it also might need to mangle all of
the packet data if doing sw wireless encryption.

This fixes kernel bugzilla #10903.  Thanks to Didier Raboud (for the
bugzilla report), Andrew Prince (for bisecting), Johannes Berg (for
bringing this bisection analysis to my attention), and Ilpo (for
trying to analyze this purely from the TCP side).

In 2.6.27 we can take another stab at this, by using something like
skb_cow_data() when the TX path of mac80211 ends up with a non-NULL
tx->key.  The ESP protocol code in the IPSEC stack can be used as a
model for implementation.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/mac80211/tx.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 1d7dd54..28d8bd5 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1562,13 +1562,13 @@ int ieee80211_subif_start_xmit(struct sk_buff *skb,
 	 * be cloned. This could happen, e.g., with Linux bridge code passing
 	 * us broadcast frames. */
 
-	if (head_need > 0 || skb_header_cloned(skb)) {
+	if (head_need > 0 || skb_cloned(skb)) {
 #if 0
 		printk(KERN_DEBUG "%s: need to reallocate buffer for %d bytes "
 		       "of headroom\n", dev->name, head_need);
 #endif
 
-		if (skb_header_cloned(skb))
+		if (skb_cloned(skb))
 			I802_DEBUG_INC(local->tx_expand_skb_head_cloned);
 		else
 			I802_DEBUG_INC(local->tx_expand_skb_head);
Comment 11 Didier Raboud 2008-06-18 04:34:23 UTC
Le mercredi 18 juin 2008 09:24:47 Johannes Berg, vous avez écrit :
> > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > > > > >
> > > > > > > >            Summary: ssh connections hang with 2.6.26-rc5
>
> Andrew Prince reported a similar problem and said he bisected it  to
> davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use
> skb_header_cloned() on TX path.") which made no sense to me so I marked
> the report as 'to investigate when I have more time'.
>
> > > and I got the attached "soft lockup".
>
> Attached where? I can't find it in the bug either.
>
> johannes

Hi, 

for unknown reasons, my mail doesn't appear in the mailing list archives and 
the big attachment has visibly been stripped.

You'll find the "soft lockup" there:

http://raboud.homelinux.org/~didier/kernel/tcpdump_oops.jpg

Regards, 

Didier
Comment 12 Michael Buesch 2008-06-18 04:39:53 UTC
On Wednesday 18 June 2008 13:34:06 Didier Raboud wrote:
> Le mercredi 18 juin 2008 09:24:47 Johannes Berg, vous avez écrit :
> > > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > > > > > >
> > > > > > > > >            Summary: ssh connections hang with 2.6.26-rc5
> >
> > Andrew Prince reported a similar problem and said he bisected it  to
> > davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use
> > skb_header_cloned() on TX path.") which made no sense to me so I marked
> > the report as 'to investigate when I have more time'.
> >
> > > > and I got the attached "soft lockup".
> >
> > Attached where? I can't find it in the bug either.
> >
> > johannes
> 
> Hi, 
> 
> for unknown reasons, my mail doesn't appear in the mailing list archives and 
> the big attachment has visibly been stripped.
> 
> You'll find the "soft lockup" there:
> 
> http://raboud.homelinux.org/~didier/kernel/tcpdump_oops.jpg

The soft lockup most likely is a followup-oops to a previous one that
locked up the machine.
Can you try to reproduce and capture the screen before waiting 61 seconds
for the watchdog to trigger. There should be another oops before that (you see
the last two lines of it on this picture)
Comment 13 Adrian Bunk 2008-06-19 01:29:19 UTC
fixed by commit 3a5be7d4b079f3a9ce1e8ce4a93ba15ae6d00111