Bug 10903
Summary: | ssh connections hang with 2.6.26-rc5 | ||
---|---|---|---|
Product: | Networking | Reporter: | Didier Raboud (didier) |
Component: | Other | Assignee: | Arnaldo Carvalho de Melo (acme) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | bunk |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.26-rc5 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 10492 |
Description
Didier Raboud
2008-06-13 02:39:16 UTC
A 2.6.26-rc networking regression? Please reply via email. ----- Forwarded message from bugme-daemon@bugzilla.kernel.org ----- Date: Fri, 13 Jun 2008 02:39:17 -0700 (PDT) From: bugme-daemon@bugzilla.kernel.org To: bunk@kernel.org Subject: [Bug 10903] New: ssh connections hang with 2.6.26-rc5 http://bugzilla.kernel.org/show_bug.cgi?id=10903 Summary: ssh connections hang with 2.6.26-rc5 Product: Networking Version: 2.5 KernelVersion: 2.6.26-rc5 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Other AssignedTo: acme@ghostprotocols.net ReportedBy: didier@raboud.com Latest working kernel version: 2.6.25-2 Earliest failing kernel version: 2.6.26-rc5 Distribution: Debian (Lenny + Sid) Hardware Environment: amd64 (Dell Latitude D630) Software Environment: KDE Problem Description: With kernel version 2.6.26-rc5, the ssh connections to remote servers randomly hang (no error message). No amelioration despite the activation of "ServerAliveInterval" on both sides. Steps to reproduce: Connect to a remote ssh server and do some stuff. After some time, the connection will hang. Please ask for details. ------- You are receiving this mail because: ------- Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Fri, 13 Jun 2008 02:39:17 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10903 > > Summary: ssh connections hang with 2.6.26-rc5 > Product: Networking > Version: 2.5 > KernelVersion: 2.6.26-rc5 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: acme@ghostprotocols.net > ReportedBy: didier@raboud.com > > > Latest working kernel version: 2.6.25-2 > Earliest failing kernel version: 2.6.26-rc5 > Distribution: Debian (Lenny + Sid) > Hardware Environment: amd64 (Dell Latitude D630) > Software Environment: KDE > Problem Description: > > With kernel version 2.6.26-rc5, the ssh connections to remote servers > randomly > hang (no error message). No amelioration despite the activation of > "ServerAliveInterval" on both sides. > > Steps to reproduce: > > Connect to a remote ssh server and do some stuff. After some time, the > connection will hang. > > Please ask for details. > On Fri, 13 Jun 2008, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Fri, 13 Jun 2008 02:39:17 -0700 (PDT) bugme-daemon@bugzilla.kernel.org > wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903 > > > > Summary: ssh connections hang with 2.6.26-rc5 > > Product: Networking > > Version: 2.5 > > KernelVersion: 2.6.26-rc5 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Other > > AssignedTo: acme@ghostprotocols.net > > ReportedBy: didier@raboud.com > > > > > > Latest working kernel version: 2.6.25-2 > > Earliest failing kernel version: 2.6.26-rc5 > > Distribution: Debian (Lenny + Sid) > > Hardware Environment: amd64 (Dell Latitude D630) > > Software Environment: KDE > > Problem Description: > > > > With kernel version 2.6.26-rc5, the ssh connections to remote servers > > randomly > > hang (no error message). No amelioration despite the activation of > > "ServerAliveInterval" on both sides. Thanks for reporting. Could you please clarify couple of things: Does this only happen with a particular server/servers? Any middleboxes in between (NAT, firewall, etc.)? Do all ssh connections hang simultaneously? How long have you waited until concluding that TCP is "hung"? Is TSO enabled (ethtool -k)? Have you tried without it? It wouldn't hurt to include info about eth hw too (e.g., lspci), though it might turn unneeded at some point of time but it might save an email round-trip. TCP can appear to hang due to vast number of reasons. Only recent changes that are suspectable is the DEFERRED_ACCEPT thing which is already reverted in the very latest Linus' tree (even -rc6 is too old for that) and few FRTO fixes (you can exclude FRTO by turning /proc/sys/net/ipv4/tcp_frto sysctl to 0 but it seems quite unlikely to change anything); your problem might well come from something else and TCP hang is just a symptom of other problem downstream. So please gather this information (at least for the relevant connections): $ netstat -pn $ cat /proc/net/tcp ...Also a tcpdump might be handy (though I don't know yet). ...Depending on your privacy needs, you may want obfuscate ip addresses that are revealed by all of those logs (ie., if you don't want to reveal with whom you're communicating with, ssh payload is encrypted anyway). > > Steps to reproduce: > > > > Connect to a remote ssh server and do some stuff. After some time, the > > connection will hang. > > > > Please ask for details. (I'll be away nearly a month after Tuesday, so I probably won't have much time to resolve this issue but I hope I've some time to take a look before I leave). Le samedi 14 juin 2008 22:45:41 Ilpo J On Sun, 15 Jun 2008, Didier Raboud wrote:
> Le samedi 14 juin 2008 22:45:41 Ilpo J
Le lundi 16 juin 2008 15:21:25 Ilpo Järvinen, vous avez écrit : > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903 > > > > > > > > > > Summary: ssh connections hang with 2.6.26-rc5 > > > > > Product: Networking > > > > > Version: 2.5 > > > > > KernelVersion: 2.6.26-rc5 > > > > > Platform: All > > > > > OS/Version: Linux > > > > > Tree: Mainline > > > > > Status: NEW > > > > > Severity: normal > > > > > Priority: P1 > > > > > Component: Other > > > > > AssignedTo: acme@ghostprotocols.net > > > > > ReportedBy: didier@raboud.com > > > > > > > > > > > > > > > Latest working kernel version: 2.6.25-2 > > > > > Earliest failing kernel version: 2.6.26-rc5 > > > > > Distribution: Debian (Lenny + Sid) > > > > > Hardware Environment: amd64 (Dell Latitude D630) > > > > > Software Environment: KDE > > > > > Problem Description: > > > > > > > > > > With kernel version 2.6.26-rc5, the ssh connections to remote > > > > > servers randomly > > > > > hang (no error message). No amelioration despite the activation of > > > > > "ServerAliveInterval" on both sides. > > > (...) > > > > The common point is my use of "iwl3945" : I have always tried the ssh > > connections through WiFi. > > > > > So please gather this information (at least for the relevant > > > connections): > > > > > > $ netstat -pn > > > $ cat /proc/net/tcp Hi, I used a script which logged both every 15 seconds on both sides (in "screen" on server side). I then triggered a hang with several "seq 1 100" in the ssh session. The logs are in the attached debug_ssh_hang.tar.gz . The hang appeared between 234327 and 234342 on the client side (so somewhere between 234324, 234339 and 234354 on the server side). I hope it'll help. > You probably run it under X, no? Please switch beforehand to some other vt > (a textual one) then (Ctrl-Alt-Fn, where n < 6) and then log in and > running that command there and see if you get some output into screen > there. If you see something (e.g., a sudden OOPS message or some other > warning printed) when it locks up, the easiest things is to take a shot > with a digicam (or write it down somewhere else) and send that shot (or > those details) to us please. I tried the following in vt1 (under the new 2.6.26-rc6 with kdm stopped): # tcpdump -i wlan0 -w /tmp/tcpdump.wlan0 and I got the attached "soft lockup". > ...Once you have a tcpdump, I can probably figure at least something out > (though it might still just point to the right direction rather than > exposing the actual cause). I can't get one... :) Regards, OdyX On Tue, 17 Jun 2008, Didier Raboud wrote:
> Le lundi 16 juin 2008 15:21:25 Ilpo J
> > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903 > > > > > > > > > > > > > > Summary: ssh connections hang with 2.6.26-rc5 Andrew Prince reported a similar problem and said he bisected it to davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use skb_header_cloned() on TX path.") which made no sense to me so I marked the report as 'to investigate when I have more time'. > > > You probably run it under X, no? Please switch beforehand to some other > vt > > > (a textual one) then (Ctrl-Alt-Fn, where n < 6) and then log in and > > > running that command there and see if you get some output into screen > > > there. If you see something (e.g., a sudden OOPS message or some other > > > warning printed) when it locks up, the easiest things is to take a shot > > > with a digicam (or write it down somewhere else) and send that shot (or > > > those details) to us please. > > > > I tried the following in vt1 (under the new 2.6.26-rc6 with kdm stopped): > > > > # tcpdump -i wlan0 -w /tmp/tcpdump.wlan0 > > > > and I got the attached "soft lockup". Attached where? I can't find it in the bug either. johannes From: Johannes Berg <johannes@sipsolutions.net> Date: Wed, 18 Jun 2008 09:24:47 +0200 > > > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903 > > > > > > > > > > > > > > > > Summary: ssh connections hang with 2.6.26-rc5 > > Andrew Prince reported a similar problem and said he bisected it to > davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use > skb_header_cloned() on TX path.") which made no sense to me so I marked > the report as 'to investigate when I have more time'. That's useful information. The kernel bugzilla entry is for the iwl3945 driver, so that matches up accurately with this too. If we can't figure out what's going on here soon (like, in less than a day) we should revert that changeset. Actually, I think I see how the changeset might be wrong. I think the encryption layer of mac80211 assumes it can write over the data area of the SKB it's working on, not just the headers. Once this happens, any retransmits done by SKB will fail because the master packet data on TCP's retransmit queue is now this encrypted garbage. From: David Miller <davem@davemloft.net> Date: Wed, 18 Jun 2008 01:05:28 -0700 (PDT) > If we can't figure out what's going on here soon (like, in less than a > day) we should revert that changeset. > > Actually, I think I see how the changeset might be wrong. I think > the encryption layer of mac80211 assumes it can write over the > data area of the SKB it's working on, not just the headers. > > Once this happens, any retransmits done by SKB will fail because the > master packet data on TCP's retransmit queue is now this encrypted > garbage. After some discussion about this with Johannes on IRC, we are absolutely convinced this is exactly the problem. I intend to send the following revert to Linus tonight so we can close this: -------------------- Revert "mac80211: Use skb_header_cloned() on TX path." This reverts commit 608961a5eca8d3c6bd07172febc27b5559408c5d. The problem is that the mac80211 stack not only needs to be able to muck with the link-level headers, it also might need to mangle all of the packet data if doing sw wireless encryption. This fixes kernel bugzilla #10903. Thanks to Didier Raboud (for the bugzilla report), Andrew Prince (for bisecting), Johannes Berg (for bringing this bisection analysis to my attention), and Ilpo (for trying to analyze this purely from the TCP side). In 2.6.27 we can take another stab at this, by using something like skb_cow_data() when the TX path of mac80211 ends up with a non-NULL tx->key. The ESP protocol code in the IPSEC stack can be used as a model for implementation. Signed-off-by: David S. Miller <davem@davemloft.net> --- net/mac80211/tx.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c index 1d7dd54..28d8bd5 100644 --- a/net/mac80211/tx.c +++ b/net/mac80211/tx.c @@ -1562,13 +1562,13 @@ int ieee80211_subif_start_xmit(struct sk_buff *skb, * be cloned. This could happen, e.g., with Linux bridge code passing * us broadcast frames. */ - if (head_need > 0 || skb_header_cloned(skb)) { + if (head_need > 0 || skb_cloned(skb)) { #if 0 printk(KERN_DEBUG "%s: need to reallocate buffer for %d bytes " "of headroom\n", dev->name, head_need); #endif - if (skb_header_cloned(skb)) + if (skb_cloned(skb)) I802_DEBUG_INC(local->tx_expand_skb_head_cloned); else I802_DEBUG_INC(local->tx_expand_skb_head); Le mercredi 18 juin 2008 09:24:47 Johannes Berg, vous avez écrit : > > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903 > > > > > > > > > > > > > > > > Summary: ssh connections hang with 2.6.26-rc5 > > Andrew Prince reported a similar problem and said he bisected it to > davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use > skb_header_cloned() on TX path.") which made no sense to me so I marked > the report as 'to investigate when I have more time'. > > > > and I got the attached "soft lockup". > > Attached where? I can't find it in the bug either. > > johannes Hi, for unknown reasons, my mail doesn't appear in the mailing list archives and the big attachment has visibly been stripped. You'll find the "soft lockup" there: http://raboud.homelinux.org/~didier/kernel/tcpdump_oops.jpg Regards, Didier On Wednesday 18 June 2008 13:34:06 Didier Raboud wrote:
> Le mercredi 18 juin 2008 09:24:47 Johannes Berg, vous avez écrit :
> > > > > > > > > http://bugzilla.kernel.org/show_bug.cgi?id=10903
> > > > > > > > >
> > > > > > > > > Summary: ssh connections hang with 2.6.26-rc5
> >
> > Andrew Prince reported a similar problem and said he bisected it to
> > davem's 608961a5eca8d3c6bd07172febc27b5559408c5d ("mac80211: Use
> > skb_header_cloned() on TX path.") which made no sense to me so I marked
> > the report as 'to investigate when I have more time'.
> >
> > > > and I got the attached "soft lockup".
> >
> > Attached where? I can't find it in the bug either.
> >
> > johannes
>
> Hi,
>
> for unknown reasons, my mail doesn't appear in the mailing list archives and
> the big attachment has visibly been stripped.
>
> You'll find the "soft lockup" there:
>
> http://raboud.homelinux.org/~didier/kernel/tcpdump_oops.jpg
The soft lockup most likely is a followup-oops to a previous one that
locked up the machine.
Can you try to reproduce and capture the screen before waiting 61 seconds
for the watchdog to trigger. There should be another oops before that (you see
the last two lines of it on this picture)
fixed by commit 3a5be7d4b079f3a9ce1e8ce4a93ba15ae6d00111 |