Bug 4836

Summary: enabling CONFIG_LLC=y crashes on llc_station_ac_send_test_r
Product: Networking Reporter: Michael Goldman (haizaar)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: DEFERRED WILL_FIX_LATER    
Severity: blocking CC: akpm
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.12 2..6.11.11 Subsystem:
Regression: --- Bisected commit-id:
Attachments: crash printout
.config file that was used to compile the kernel
linkloop package tarball, version 0.0.1
patch for linkloop-0.0.1
.config file for kernel 2.6.7
bumps kernel clock to 2khz
Ajusts polling time of rocket port driver
Makes rocker port driver to 'export' pci-ids
Screenshot of the vanilla 2.6.12 crash on HP XW6200

Description Michael Goldman 2005-07-03 15:40:08 UTC
Distribution: LFS (Linux From scratch) based
Hardware Environment: HP Proliant DL580, HP XW6200
Software Environment: 2.6.11.12-vanilla, 2.6.12-vanilla+genpatches-2.6.12-r5
Problem Description: Enabling CONFIG_LLC=y causes kernel to crash on
llc_station_ac_send_test_r
On the other hand, when I've compiled 2.6.12 with CONFIG_LLC=n, everything
worked fine - kernel did not crashed, no matter linkloop_reply daemon was
running or not.

The details:
I'm using custom distribution based on LFS 6.0 (2.6.7, gcc-3.4.1,
glibc-2.3.4-cvs). I've upgraded to newer kernel and enabled CONFIG_LLC=y (and
CONFIG_LLC2=y). All of my machines (both Proliant and XW) have Broadcom NICs. I
use tg3 driver for them. 
If one uses linkloop package (from sourceforge) to 'ping' your upgraded machine
by mac address using linkloop, kernel crashes.
In the beginnig I've thought its the problem with tg3 driver - that was the
reason I've tried 2.6.12 - there were many changes to tg3 driver (according to
ChangeLog). Applying Gentoo latest patchset to 2.6.12 did not help as well.

Steps to reproduce:
1. Compile LFS 6.0
2. Upgrade to 2.6.11.11 (or 12). 
3. Download linkloop package from sourcforge.
4. Ping the machine from network by MAC address using linkloop tool
5. See the immediate crash
Comment 1 Michael Goldman 2005-07-03 15:43:48 UTC
Created attachment 5253 [details]
crash printout

This is the output printed on console when kernel crashes. Since the system was
already dead, I've had to write it down manually.
Comment 2 Michael Goldman 2005-07-03 15:45:55 UTC
Created attachment 5254 [details]
.config file that was used to compile the kernel

this is .config for 2.6.12 + genpatches-2.6.12-5.base +
genpatches-2.6.12-5.extras (except of speakup)
Comment 3 Michael Goldman 2005-07-03 15:50:55 UTC
Created attachment 5255 [details]
linkloop package tarball, version 0.0.1

this is the linkloop package what enables 'pinging' computers by MAC if they
run linkloop_reply daemon.
NOTE: With CONFIG_LLC=y, kernel crashed even when it was not running
linkloop_reply daemon
Comment 4 Michael Goldman 2005-07-03 15:54:04 UTC
Created attachment 5256 [details]
patch for linkloop-0.0.1

The patch fixes some small bugs, performs more precise checks, enables
broadcast and utilizes new PF_PACKET interface. This pactch was applied on
tested platform
Comment 5 Michael Goldman 2005-07-03 16:09:04 UTC
Another note: crash 'screenshot' was made on Proliant DL580 machine, but crash
outout on XW6200 machines was very similar. I can provide all additional
information (lspci, etc...) if one request it.
Comment 6 Andrew Morton 2005-07-03 16:36:39 UTC
It's not clear - could you please describe which patches, if any, were applied
to the vanilla 2.6.12 kernel?

Are you able to identify earlier 2.6-based kernels which worked OK?
If so, which?  2.6.7, I assume.  Have you tried anything later?

Comment 7 Andrew Morton 2005-07-03 16:38:13 UTC
Also, there should have been some extra debug output from this:

        printk(KERN_EMERG "skb_over_panic: text:%p len:%d put:%d head:%p "
                          "data:%p tail:%p end:%p dev:%s\n",
               here, skb->len, sz, skb->head, skb->data, skb->tail, skb->end,
               skb->dev ? skb->dev->name : "<NULL>");

Can you obtain that?
Comment 8 Michael Goldman 2005-07-04 04:49:00 UTC
Ok. Here is more info:
I've checked the situation with 2.6.7 kernel:
1. With CONFIG_LLC=n no problem occurs 
2. I've set CONFIG_LLC=m and CONFIG_LLC2=m. Recompiled, and now I have two
modules: llc and llc2
3. Rebooting the system (2.6.7) modules are NOT loaded by default.
4. 
   a. linkloop <MAC> - all ok
   b. modprope llc - dmesg says nothing new
   c. linkloop <MAC> - still all ok
   d. modprobe llc2 - dmesg says: 'NET: Registered protocol family 26'
   e. linkloop <MAC> - immediate crash

Note: on 2.6.11/12 llc was compiled into kernel, and not as modules

I have to mention that this 2.6.7 based system is working in 'production' for
about a half year, and proved to be stable. I've discovered the problem when
I've tried to upgrade to newer kernel, and payed attention to that LLC option in
menuconfig. Basicly I do not use it, but as we see now, 2.6.7 does have this bug
as well.
Comment 9 Michael Goldman 2005-07-04 04:51:26 UTC
Created attachment 5258 [details]
.config file for kernel 2.6.7

Note, that 2.6.11/12 kernels was highly modularized, as opposed to 2.6.7
Comment 10 Michael Goldman 2005-07-04 04:53:29 UTC
Created attachment 5259 [details]
bumps kernel clock to 2khz

note, that this patch was applied to 2.6.7 kernel only , and not to 2.6.11/12
Comment 11 Michael Goldman 2005-07-04 04:55:15 UTC
Created attachment 5260 [details]
Ajusts polling time of rocket port driver

This patch ajusts polling time for Comtrol Rocket Port driver. It was applied
to all kernels
Comment 12 Michael Goldman 2005-07-04 04:56:40 UTC
Created attachment 5261 [details]
Makes rocker port driver to 'export' pci-ids

This patch make driver to export pci-ids of devices supports. This helps to
autodetect Rocket Port cards. This patch was applied to all kernels.
Comment 13 Michael Goldman 2005-07-04 05:00:35 UTC
In addtition to patches attachaed, 2.6.12 (and only it) was patched with latest
Gentoo patchset (genpatches-2.6.12-5 -
http://dev.gentoo.org/~dsd/genpatches/patches-2.6.12-5.htm)
Comment 14 Michael Goldman 2005-07-04 05:01:49 UTC
Andrew, can please explain what exactly do you want me to with this?:
        printk(KERN_EMERG "skb_over_panic: text:%p len:%d put:%d head:%p "
                          "data:%p tail:%p end:%p dev:%s\n",
               here, skb->len, sz, skb->head, skb->data, skb->tail, skb->end,
               skb->dev ? skb->dev->name : "<NULL>");

Comment 15 Andrew Morton 2005-07-04 15:44:28 UTC
You'll have trouble getting any of the net guys to attend to
this with so many kernel patches applied.  Please reproduce the
bug on a vanilla kernel.org kernel.

wrt this:

      printk(KERN_EMERG "skb_over_panic: ...

the output from that printk should have appeared on your terminal
or in the logs.  It might help the net guys fix the bug.
Comment 16 Michael Goldman 2005-07-05 04:26:39 UTC
Ok. I've reproduced the bug on VANILLA kernels as well - both 2.6.7 and 2.6.12.
Crash occurs under the same conditions in exactly same way.

Andrew, printk output you've asked for:
sbk_over_panic: text:f93e81ae len:1500 put:1497 head:f517de00 data:f517de2f
tail:f517e40b end:f517de80 dev:eth0
Comment 17 Michael Goldman 2005-07-05 04:29:24 UTC
Created attachment 5268 [details]
Screenshot of the vanilla 2.6.12 crash on HP XW6200
Comment 18 Patrick McHardy 2005-07-20 09:56:34 UTC
sbk_over_panic: text:f93e81ae len:1500 put:1497 head:f517de00 data:f517de2f
tail:f517e40b end:f517de80 dev:eth0

I'm not familiar with the code or the protocol, but it seems the incoming frame
had a size of 1500 bytes and it is tried to copy it to a newly allocated packet
(llc_station_ac_send_test_r -> llc_pdu_init_as_test_rsp), but the newly allocated
packet only has 128-50 bytes available (llc_alloc_frame uses a fixed size).

So the fix would be to either allocate a larger frame or copy less data. Arnaldo?
Comment 19 Arnaldo Carvalho de Melo 2005-07-28 14:32:17 UTC
Sorry guys, just not enough bandwidth right now, I'm working on getting a  
better infrastructure in place and then I'll go over LLC making it use this  
infrastructure, heck, I have lots of patches at:  
  
http://www.kernel.org/pub/linux/kernel/people/acme/v2.6/llc-2.6.1/  
  
I really plan to go over those and merge with new stuff I'm working on, but  
for now I using the PF_LLC BSD sockets interface is very experimental, needs  
a thorough audit. 
 
The minimal code needed for Appletalk, IPX, token ring, etc should be safe 
tho.