Bug 27742

Summary: PPP over SSH tunnel triggers OOPS
Product: Networking Reporter: Kris Karas (bugs-a21)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: RESOLVED UNREPRODUCIBLE    
Severity: normal CC: alan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30 ? Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Hand-copied OOPS from 2.6.37 kernel

Description Kris Karas 2011-01-28 21:58:45 UTC
Created attachment 45412 [details]
Hand-copied OOPS from 2.6.37 kernel

When creating a VPN connection by using PPP tunneled over SSH, the kernel will OOPS when certain traffic patterns are encountered.  (See attached OOPS)

I first created such a VPN connection using kernel 2.6.33, which is affected.  Kernel 2.6.27 is not affected.  I have not attempted to binary-search for the exact commit, but am merely guessing it is in kernel 2.6.30 (as a variety of ppp-related commits appear in the changelog there).

The VPN tunnel is established by invoking 'pppd' with the 'pty' parameter set to invoke "ssh remotehost.com pppd" which establishes a SSH tunnel over IPv4 to the remote host and then invokes the remote pppd to handle the other end of the point-to-point VPN.

Reproducing this bug is not easy.  With the ppp-ssh-ppp tunnel open, I have tried triggering the OOPs by sending PINGs, rsync-ing files in both directions, opening interactive SSH connections.  Nothing seems to trigger the OOPS except one: running Mozilla Thunderbird on the remote end; it opens several IMAP connections over the tunnel simultaneously.  Typically, the OOPS will occur within 1 or 2 seconds of invoking Thunderbird.

When the OOPS occurs, usually the console will be scrolling wildly with OOPS after OOPS, making copying impossible.  It has taken me two months of repeated tries to get one OOPS that remained on-screen and could be copied.  The kernel is in a hard-run state when the OOPS occurs; nothing gets logged to syslog, the keyboard is unresponsive (magic sysrq key does nothing).
Comment 1 Andrew Morton 2011-01-28 22:34:03 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 28 Jan 2011 21:58:49 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=27742
> 
>            Summary: PPP over SSH tunnel triggers OOPS
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.30 ?
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: acme@ghostprotocols.net
>         ReportedBy: ktk@bigfoot.com
>         Regression: Yes
> 
> 
> Created an attachment (id=45412)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=45412)
> Hand-copied OOPS from 2.6.37 kernel
> 
> When creating a VPN connection by using PPP tunneled over SSH, the kernel
> will
> OOPS when certain traffic patterns are encountered.  (See attached OOPS)
> 
> I first created such a VPN connection using kernel 2.6.33, which is affected. 
> Kernel 2.6.27 is not affected.  I have not attempted to binary-search for the
> exact commit, but am merely guessing it is in kernel 2.6.30 (as a variety of
> ppp-related commits appear in the changelog there).
> 
> The VPN tunnel is established by invoking 'pppd' with the 'pty' parameter set
> to invoke "ssh remotehost.com pppd" which establishes a SSH tunnel over IPv4
> to
> the remote host and then invokes the remote pppd to handle the other end of
> the
> point-to-point VPN.
> 
> Reproducing this bug is not easy.  With the ppp-ssh-ppp tunnel open, I have
> tried triggering the OOPs by sending PINGs, rsync-ing files in both
> directions,
> opening interactive SSH connections.  Nothing seems to trigger the OOPS
> except
> one: running Mozilla Thunderbird on the remote end; it opens several IMAP
> connections over the tunnel simultaneously.  Typically, the OOPS will occur
> within 1 or 2 seconds of invoking Thunderbird.
> 
> When the OOPS occurs, usually the console will be scrolling wildly with OOPS
> after OOPS, making copying impossible.  It has taken me two months of
> repeated
> tries to get one OOPS that remained on-screen and could be copied.  The
> kernel
> is in a hard-run state when the OOPS occurs; nothing gets logged to syslog,
> the
> keyboard is unresponsive (magic sysrq key does nothing).
> 

> skb_over_panic: text:c12a354f len:847 put:847 head:f57e8c00 data:f57e8c00
> tail:0xf57e8f4f end:0xf57e8e80 dev:<NULL>
> kernel BUG at net/core/skbuff.c:127!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/devices/virtual/net/ppp0/flags
> Modules linked in:
> 
> Pid: 0, comm: swapper Not tainted 2.6.37 #1 0KH290/OptiPlex GX620
> EIP: 0060:[<c1330110>] EFLAGS: 00010282 CPU: 0
> EIP is at skb_put+0x82/0x84
> EAX: 00000089 EBX: f57e8f4f ECX: c151579c EDX: 00000046
> ESI: 00000000 EDI: c1530760 EBP: f67bb384 ESP: f6409d50
>  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Process swapper (pid: 0, ti=f6408000 task=c15114a0 task.ti=c1502000)
> Stack:
>  c14d4590 c12a354f 0000034f 0000034f f57e8c00 f57e8c00 f57e8f4f f57e8e80
>  c14d2509 f67bb380 f454db80 c12a354f 000005e0 00000244 f5f944c2 0000034f
>  c1390a0a f67bb3d4 f4648380 f67bb394 f67bb3a4 00000202 f454db80 f67bb000
> Call Trace:
>  [<c12a354f>] ? ppp_xmit_process+0x45a/0x4e6
>  [<c12a354f>] ? ppp_xmit_process+0x45a/0x4e6
>  [<c1390a0a>] ? tcp_manip_pkt+0xad/0xcb
>  [<c12a36d4>] ? ppp_start_xmit+0xf9/0x175
>  [<c133a496>] ? dev_hard_start_xmit+0x2a4/0x5c3
>  [<c1347cad>] ? sch_direct_xmit+0xb9/0x184
>  [<c134c663>] ? nf_iterate+0x52/0x76
>  [<c1362d56>] ? ip_finish_output+0x0/0x294
>  [<c133a88e>] ? dev_queue_xmit+0xd9/0x3b0
>  [<c1362d56>] ? ip_finish_output+0x0/0x294
>  [<c1362f32>] ? ip_finish_output+0x1dc/0x294
>  [<c1362d56>] ? ip_finish_output+0x0/0x294
>  [<c1360d66>] ? ip_forward_finish+0x36/0x42
>  [<c135f8a4>] ? ip_rcv_finish+0x42/0x323
>  [<c13384ac>] ? __netif_receive_skb+0x225/0x299
>  [<c1048b56>] ? getnstimeofday+0x42/0xe8
>  [<c13386ab>] ? netif_receive_skb+0x41/0x64
>  [<c1339356>] ? dev_gro_receive+0x146/0x1dd
>  [<c133955e>] ? napi_gro_receive+0xa5/0xb3
>  [<c129ae2f>] ? tg3_poll_wor+0x5df/0xaca
>  [<c1007340>] ? nommu_sync_single_for_device+0x0/0x1
>  [<c129b3e2>] ? tg3_poll+0x43/0x19a
>  [<c133893b>] ? net_rx_action+0x6c/0xf4
>  [<c1031eb5>] ? __do_softirq+0x77/0xf0
>  [<c1031e3e>] ? __do_softirq+0x0/0xf0
>  <IRQ>
>  [<c1031fe6>] ? irq_exit+0x5d/0x5f
>
Comment 2 David S. Miller 2011-01-28 22:55:28 UTC
From: Andrew Morton <akpm@linux-foundation.org>
Date: Fri, 28 Jan 2011 14:32:38 -0800

>> skb_over_panic: text:c12a354f len:847 put:847 head:f57e8c00 data:f57e8c00
>> tail:0xf57e8f4f end:0xf57e8e80 dev:<NULL>
>> kernel BUG at net/core/skbuff.c:127!
...
>> Pid: 0, comm: swapper Not tainted 2.6.37 #1 0KH290/OptiPlex GX620
>> EIP: 0060:[<c1330110>] EFLAGS: 00010282 CPU: 0
>> EIP is at skb_put+0x82/0x84
...
>> Call Trace:
>>  [<c12a354f>] ? ppp_xmit_process+0x45a/0x4e6
>>  [<c12a354f>] ? ppp_xmit_process+0x45a/0x4e6
>>  [<c1390a0a>] ? tcp_manip_pkt+0xad/0xcb
>>  [<c12a36d4>] ? ppp_start_xmit+0xf9/0x175

I took a quick look at this, I can surmise that we have a packet we
are trying to compress (that's the only way I see in the
ppp_xmit_process() code paths that we can get an skb_put() call so
large).

And we can see from the skb_over_panic message that we have an SKB
which was allocated with 640 bytes of space, but we are trying to
"put" 847 bytes into it which is too large and overflows.

Can you run with the following debugging patch and see what it prints
out when this happens?

diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 9f6d670..06c6ea7 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1093,6 +1093,15 @@ pad_compress_skb(struct ppp *ppp, struct sk_buff *skb)
 	if (len > 0 && (ppp->flags & SC_CCP_UP)) {
 		kfree_skb(skb);
 		skb = new_skb;
+#if 1
+		if (len > (skb->end - skb->tail)) {
+			printk(KERN_ERR "pad_compress_skb: Compression overflow ["
+			       "new_skb_size(%d) compressor_skb_size(%d) "
+			       "hard_header_len(%d) len(%d)]\n",
+			       new_skb_size, compressor_skb_size,
+			       ppp->dev->hard_header_len, len);
+		}
+#endif
 		skb_put(skb, len);
 		skb_pull(skb, 2);	/* pull off A/C bytes */
 	} else if (len == 0) {
@@ -1179,6 +1188,9 @@ ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
 			/* didn't compress */
 			kfree_skb(new_skb);
 		} else {
+#if 1
+			unsigned int orig_skb_len = skb->len;
+#endif
 			if (cp[0] & SL_TYPE_COMPRESSED_TCP) {
 				proto = PPP_VJC_COMP;
 				cp[0] &= ~SL_TYPE_COMPRESSED_TCP;
@@ -1188,6 +1200,13 @@ ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
 			}
 			kfree_skb(skb);
 			skb = new_skb;
+#if 1
+			if (len > (skb->end - skb->tail)) {
+				printk(KERN_ERR "slhc_compress_skb: Compression overflow ["
+				       "skb->len(%u) hard_header_len(%d) len(%d)]\n",
+				       orig_skb_len, ppp->dev->hard_header_len, len);
+			}
+#endif
 			cp = skb_put(skb, len + 2);
 			cp[0] = 0;
 			cp[1] = proto;
Comment 3 Kris Karas 2011-01-31 16:58:31 UTC
bugzilla-daemon@bugzilla.kernel.org wrote:
> [...]
> And we can see from the skb_over_panic message that we have an SKB
> which was allocated with 640 bytes of space, but we are trying to
> "put" 847 bytes into it which is too large and overflows.
>
> Can you run with the following debugging patch and see what it prints
> out when this happens?
>
> diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
> index 9f6d670..06c6ea7 100644

Applied, tried and triggered.
But alas, the OOPS messages were scrolling off the screen in an infinite 
loop, making it impossible to copy anything on-screen.  The klog 
timestamps were frozen, and the Magic SysRq key was utterly non-responsive.

Is there some patch I can apply to the kernel which will force it to 
"--more--" paginate any OOPS output?  As mentioned in my original 
submission, it has taken literally months of triggering before I "got 
lucky" and had just one OOPS that was on-screen and copyable.  (The 
recent oops-to-memory feature would be useless, as without Magic-SysRq 
working, my only way to make the machine responsive is power-cycling.)

I haven't yet found a good way to trigger this remotely on the "client", 
while I'm sitting at the "server" end of the PPP link; the machines are 
geographically distant and require some travel in order to trigger at 
one and reset/copy-oops at the other.
Comment 4 Kris Karas 2012-08-18 20:50:06 UTC
I just noticed Alan cc'ed himself to this.
This bug has been sitting stagnant for some time now.

Regrettably, changes in network infrastructure where the machine in question is located altered the timing such that the OOPS (even with the same kernel) was no longer triggerable.  I (the OP) haven't been able to reproduce this in nearly a year.

I suggest closing this as "can not reproduce" unless anybody else has been affected.
Comment 5 Alan 2012-08-18 21:04:40 UTC
I have a couple more examples, a trace and root cause for it in the tty layer. So it's a live bug. Whether you saw that bug or a different one of the several fixed so far I don't know.
Comment 6 Kris Karas 2012-08-18 21:08:28 UTC
OK, well then, I'll be happy to test any patches, though it may be limited to "doesn't break anything new".  :-)
Comment 7 Alan 2013-12-11 12:12:07 UTC
Closing as obsolete, there are a pile of relevant tty hangup changes that have hopefully fixed this