Bug 11386

Summary:	p54 always causes BUG under high traffic in interrupt handler
Product:	Drivers	Reporter:	Sean Young (sean)
Component:	network-wireless	Assignee:	Christian Lamparter (chunkeey)
Status:	RESOLVED OBSOLETE
Severity:	normal	CC:	alan, chunkeey, colinf, edpeur
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.34	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	p54-driver with new pci firmware stop queue stop

Description Sean Young 2008-08-20 13:15:31 UTC

Latest working kernel version: never (although prism54 driver does work)
Earliest failing kernel version: 2.6.26 (have not tested earlier versions)
Distribution: Ubuntu (but mainline kernel)
Hardware Environment:

AMD Elan (http://www.embeddedarm.com/products/board-detail.php?product=TS-5500)

sean@tiger:~$ lspci
00:00.0 Host bridge: Advanced Micro Devices [AMD] ELanSC520 Microcontroller
00:0b.0 USB Controller: OPTi Inc. 82C861 (rev 10)
00:0c.0 CardBus bridge: Texas Instruments PCI1510 PC card Cardbus Controller
00:0d.0 Ethernet controller: Davicom Semiconductor, Inc. 21x4x DEC-Tulip compat)
01:00.0 Network controller: Intersil Corporation ISL3890 [Prism GT/Prism Duette)
sean@tiger:~$ lspci -n
00:00.0 0600: 1022:3000
00:0b.0 0c03: 1045:c861 (rev 10)
00:0c.0 0607: 104c:ac56
00:0d.0 0200: 1282:9102 (rev 40)
01:00.0 0280: 1260:3890 (rev 01)

Software Environment:

dmesg:
p54: LM86 firmware
p54: FW rev 2.7.0.0 - Softmac protocol 4.1
p54: unknown eeprom code : 0x1
p54: unknown eeprom code : 0x1007
p54: unknown eeprom code : 0x1008
p54: unknown eeprom code : 0x1100
p54: unknown eeprom code : 0x3
p54: unknown eeprom code : 0x1905
phy0: Selected rate control algorithm 'pid'
phy0: hwaddr 00:04:e2:aa:48:d2, isl3890
firmware: requesting isl3886

Problem Description:
Under high traffic, e.g. ftp a large file, after about 10 to 50MB, a BUG occurs. Also tried 2.6.25 and 2.6.26 but those fail too in the interrupt handler.

CONFIG_PRISM54 does work however. 

Steps to reproduce:
Just done a git-pull.

BUG: unable to handle kernel NULL pointer dereference at 00000088
IP: [<c01ccf4c>] skb_put+0x4/0x2e
Oops: 0000 [#1]

Pid: 836, comm: ftp Not tainted (2.6.27-rc3-00632-g1bbe44f #1)
EIP: 0060:[<c01ccf4c>] EFLAGS: 00010202 CPU: 0
EIP is at skb_put+0x4/0x2e
EAX: 00000000 EBX: 00000000 ECX: 80000012 EDX: 00000622
ESI: 00000003 EDI: c1c0ace0 EBP: c1c9d044 ESP: c0883a68
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process ftp (pid: 836, ti=c0882000 task=c1c3f440 task.ti=c0882000)
Stack: 00000000 00000003 c01aa419 c1c0a180 c1c9d000 c103e4c0 c1f26aa4 00000011
       00000004 00000002 c1c13c60 00000000 00000000 0000000a c01245d1 c0277aa4
       0000000a c0883adc 00000230 c0125569 00000000 0000000a c0104027 c0277aa4
Call Trace:
 [<c01aa419>] p54p_interrupt+0x129/0x1de
 [<c01245d1>] handle_IRQ_event+0x1a/0x3f
 [<c0125569>] handle_level_irq+0x7a/0x8d
 [<c0104027>] do_IRQ+0x4e/0x64
 [<c0102c03>] common_interrupt+0x23/0x30
 [<c0163c8e>] ext3_test_allocatable+0x20/0x2d
 [<c0163e88>] ext3_try_to_allocate+0x9d/0x227
 [<c016474c>] ext3_try_to_allocate_with_rsv+0x2c0/0x38b
 [<c01649e8>] ext3_new_blocks+0x1d1/0x4c3
 [<c015269b>] __bread+0x6/0x67
 [<c01672ba>] ext3_get_blocks_handle+0x34b/0x790
 [<c0129794>] __rmqueue+0x14/0x196
 [<c0167890>] ext3_get_block+0x83/0xb6
 [<c01518d5>] __block_prepare_write+0xe1/0x2a4
 [<c0151b2a>] block_write_begin+0x71/0xcc
 [<c016780d>] ext3_get_block+0x0/0xb6
 [<c0168a28>] ext3_write_begin+0xc1/0x161
 [<c016780d>] ext3_get_block+0x0/0xb6
 [<c012721b>] generic_file_buffered_write+0xec/0x55e
 [<c0126489>] file_remove_suid+0x18/0x47
 [<c0127c79>] __generic_file_aio_write_nolock+0x401/0x452
 [<c0102c03>] common_interrupt+0x23/0x30
 [<c01c85ca>] sock_aio_read+0xb8/0xc2
 [<c0127d03>] generic_file_aio_write+0x39/0x8d
 [<c01656f9>] ext3_file_write+0x19/0x84
 [<c0138c02>] do_sync_write+0xbd/0x104
 [<c012b2b1>] wb_timer_fn+0xc/0x27
 [<c0113f5c>] run_timer_softirq+0xf3/0x133
 [<c011a959>] autoremove_wake_function+0x0/0x2b
 [<c01aa496>] p54p_interrupt+0x1a6/0x1de
 [<c0138b45>] do_sync_write+0x0/0x104
 [<c01392dc>] vfs_write+0x7f/0xec
 [<c013965f>] sys_write+0x3c/0x63
 [<c01029e2>] syscall_call+0x7/0xb
 =======================
Code: 00 00 00 01 50 50 8b 80 94 00 00 00 3b 83 90 00 00 00 73 0b 8b 4c 24 04 8
EIP: [<c01ccf4c>] skb_put+0x4/0x2e SS:ESP 0068:c0883a68
Kernel panic - not syncing: Fatal exception in interrupt

Comment 1 John W. Linville 2008-08-22 10:48:56 UTC

Looks like p54pci.c around line 338:

                while (i != idx) {
                        u16 len;
                        struct sk_buff *skb;
                        desc = &ring_control->rx_data[i];
                        len = le16_to_cpu(desc->len);
                        skb = priv->rx_buf[i];

                        skb_put(skb, len);

So, probably a buffer starvation problem under high traffic.  I'll try to dig deeper later -- anyone else is welcome to analyze further in the meantime.

Comment 2 Christian Lamparter 2008-09-02 12:12:58 UTC

Hmm, this bug could be fixed in the wireless-next.
7262d59366f972b898ea134639112d34bcac35b3 ("p54pci: rx tasklet refactoring") 
(which is actually a resend of the old "[RFC][PATCH 2/4] p54: p54pci updates" back in april)

BTW, if you switch to the wireless-next git, there is another pending patch 
"[PATCH] p54pci: increase ring buffer index counter when skipping" (see linux-wireless mailing-list archive for this one) which you probably want as well.

Regards,
   Chr.

Comment 3 Sean Young 2008-09-02 13:01:00 UTC

I found the patch from the mailing-list archive here:

http://thread.gmane.org/gmane.linux.kernel.wireless.general/19977

It seems to be incomplete; there is at least a closing brace missing. I've tested it by assuming all that is missing is a closing brace (on top of the first patch). 

Either way after a short period of data reception all wireless activity stops. There is no error in dmesg.

Comment 4 Christian Lamparter 2008-09-02 14:28:38 UTC

hmm, http://marc.info/?l=linux-wireless&m=122021355927239 is complete. So, wrong alarm I guess...

Anyway, the "all wireless activity stops" (but no oops or any hint what could be wrong this time) could be anything from a firmware crash, an unhandled oom/slow cpu case, or problem with multiqueue(but since you hopefully pulled the latest wireless-next this one is unlikely).

well so, a shot in the dark... what happens if you lower the MTU with ifconfig down to 1300 or less?

Regards,
 Chr

Comment 5 Sean Young 2008-09-02 15:42:25 UTC

I hadn't pulled wireless-next; I only applied the two patches mentioned under comment #2 to the latest Linus' git tree.

Now I've pulled wireless-next and applied the second patch. Same effect: wireless activity stops with no hint. Note that this CPU is slow but very little apart from ftp is running.

Lowering the mtu has no effect, with neither trees.

Comment 6 Christian Lamparter 2008-09-03 13:33:20 UTC

Created attachment 17600 [details]
p54-driver with new pci firmware

out-of-tree variant... (kernel build environment necessary)
just extract, put the firmware into the right place (/lib/firmware ?), run make, make unload, make load and test...

Comment 7 Christian Lamparter 2008-09-03 13:41:09 UTC

Alright... so let's see if a new firmware helps in your case.

I've already attached the latest driver code with a new firmware that is known to perform better without crashing on low ressources.

(as said in the comment, to #6, it has a make file to build it out-of-tree, so you don't have to look for all patches that are scattered around on the Mailing-List).

Comment 8 Sean Young 2008-09-04 13:51:32 UTC

I built the attached driver in wireless-next and after 4MB wireless activity again stopped. It was not built as a module and WEP is enabled. Also the new firmware was in /lib/firmware/$(uname -r)/. I also tried with WEP disabled (just open) and the same thing happened.

I wouldn't mind digging into this myself, but am a bit lost for how this thing actually works. I couldn't find any documentation -- is there any available?

p54pci 0000:01:00.0: enabling device (0000 -> 0002)
p54pci 0000:01:00.0: setting latency timer to 64
firmware: requesting isl3886
p54: LM86 firmware
p54: FW rev 2.13.1.0 - Softmac protocol 5.5
p54: unknown eeprom code : 0x1
p54: unknown eeprom code : 0x1007
p54: unknown eeprom code : 0x1008
p54: unknown eeprom code : 0x1100
p54: unknown eeprom code : 0x3
p54: unknown eeprom code : 0x1905
phy0: hwaddr 00:04:e2:aa:48:d2, MAC:isl3890 RF:Frisbee
phy0: Selected rate control algorithm 'pid'
firmware: requesting isl3886
wlan0: authenticate with AP 00:14:7f:30:17:e9
wlan0: authenticated
wlan0: associate with AP 00:14:7f:30:17:e9
wlan0: RX AssocResp from 00:14:7f:30:17:e9 (capab=0x411 status=0 aid=1)
wlan0: associated

Thanks

Comment 9 Christian Lamparter 2008-09-04 16:17:23 UTC

Not much documentation. mostly windows driver and usb-snoop.
but you can find some information in the old "islsm" driver
http://islsm.org/wiki/

but your dmesg looks a bit suspicious / truncated... 

Isn't mac80211-stack trying to reconnect? normally it should be full
with proberesp & authentication timeouts... does iwlist wlan0 scan
still shows your AP?

Comment 10 Christian Lamparter 2008-09-04 16:21:04 UTC

Created attachment 17626 [details]
stop queue stop

well, could be... the downside of this workaround is package-loss

Comment 11 Sean Young 2008-09-05 08:56:22 UTC

I've waited for 5 minutes but mac80211-stack isn't trying to reconnect. iwconfig is still showing it is connected. iwlist wlan0 scan shows:

sean@tiger:~$ iwlist wlan0 scan
wlan0     Scan completed :
          Cell 01 - Address: 00:14:7F:30:17:E9
                    ESSID:"44 Millbrooke Court"
                    Mode:Master
                    Channel:3
                    Frequency:2.422 GHz (Channel 3)
                    Quality=64/100  Signal level:82/127  
                    Encryption key:on
                    IE: Unknown: 00133434204D696C6C62726F6F6B6520436F757274
                    IE: Unknown: 010882848B962430486C
                    IE: Unknown: 030103
                    IE: Unknown: 2A0100
                    IE: Unknown: 2F0100
                    IE: Unknown: 32040C121860
                    IE: Unknown: DD06001018020000
                    IE: Unknown: DD180050F2020101080003A4000027A4000042435E00620
                    Bit Rates:1 Mb/s; 2 Mb/s; 5.5 Mb/s; 11 Mb/s; 18 Mb/s
                              24 Mb/s; 36 Mb/s; 54 Mb/s; 6 Mb/s; 9 Mb/s
                              12 Mb/s; 48 Mb/s
                    Extra:tsf=0000000024fe5184
                    Extra: Last beacon: 100ms ago

Additionally, if I bring the interface down & up again, all is well again.

With the patch the connection stay alive for longer. I could download 56MiB.

Comment 12 Christian Lamparter 2008-09-05 10:04:50 UTC

hmm: iwlist looks a bit suspicious... (iwlist seems to be a bit outdated?)

and more worrying: the low tsf value tells me that your accesspoint has an uptime of just about 5-8 minutes? (are these the same 5 minutes you have waited after the traffic died?)

Comment 13 Sean Young 2008-09-07 08:41:02 UTC

I had just turned wep on again on the accesspoint. The accesspoint is used lots from a laptop and seems to work fine on that one, at least. Note also that with the original prism54 it also works fine with the same wireless card.

I'll try upgrading wireless-tools.

Comment 14 Christian Lamparter 2008-09-07 15:31:12 UTC

Well, do you have a extra wifi card that you can put into monitor mode and
capture the last packages of p54 before the link dies?

BTW: the prism54 vs. p54 is a bit of a apples and oranges comparison.
p54 is a pure softmac driver that relies on mac80211 to scan, assoc, encryption & make 802.11 frames out of 802.3(ethernet), while prism54 just looks like a normal ethernet to the kernel with some wireless ioctls.

and you should really see at least missing probe responses & authentication timeouts right after the link dies in the dmesg. but there isn't any...

Comment 15 Sean Young 2010-01-15 19:20:02 UTC

I've just tried 2.6.32.3 and I can no longer reproduce the problem. 

The leds on the PCMCIA card aren't blinking any more though.

Comment 16 Christian Lamparter 2010-01-15 20:05:51 UTC

Great... It's been a loooong long time.

About your LEDs: The LEDs are now controlled by software/mac80211-stack/user exclusivly. CONFIG_P54_LEDS must be selected (This is done automatically
if CONFIG_MAC80211_LEDS & CONFIG_LEDS_CLASS are available to the driver) 
in order to get any sort of visual feedback.

Regards,
 Chr

Comment 17 Sean Young 2010-01-17 22:36:26 UTC

Changing the .config fixes the LEDs.

Unfortunately the original problem still exists. It happens with WEP, WPA and no encryption. Here is a tcpdump from another machine:

http://www.msxnet.org/tcpdump.1.gz

192.168.1.13 is the machine with the fault card
192.168.1.1 is the machine a download via http is done. After ~5MB the wireless stops working.

I'm not sure how to interpret the dump.

Comment 18 Christian Lamparter 2010-01-17 23:14:46 UTC

Interesting, 
I assume that enabling the LEDs has caused the bug to reappear?

I just posted a 3 patches for p54pci:

[1/3] http://patchwork.kernel.org/patch/73555/ (click on "Download")
[2/3] http://patchwork.kernel.org/patch/73556/
[3/3] http://patchwork.kernel.org/patch/73561/

Please let me know if they help, or if I have to dig deeper.

Regards,
     Chr

Comment 19 Sean Young 2010-01-18 21:14:28 UTC

Unfortunately having the LEDs config enabled or not makes no difference. I've tried 2.6.32.3 with and without LED enabled in .config, and after ~20G the connection just hangs.

I've also tried 2.6.33-rc4 with the three patches above. Still it hangs after about 20G (sometimes 13G, sometimes 35G). There is nothing logged in dmesg.

I've tried these firmware files:

root@tiger:/lib/firmware# md5sum isl3886pci  isl3886
ff7536af2092b1c4b21315bd103ef4c4  isl3886pci
8ff41cff31c9323330d6170b54735477  isl3886

So nothing I've tried made it work I'm afraid. Anything else I can try? 

Thanks!

Comment 20 edpeur 2010-01-30 10:40:55 UTC

Christian, thank you for these patches as they allow my computer to not crash.
Without these patches it was crashing within few seconds of a high speed transfer.

Comment 21 Christian Lamparter 2010-01-30 13:35:31 UTC

I'm still looking into Sean's "connection hang" issue.

The testing system is:
IBM Tablet X41 (Pentium M throttled to ~200MHz)
PC Card is an old Netgear WG511 (Prism54 Full MAC).
Firmware MD5: ff7536af2092b1c4b21315bd103ef4c4 (2.13.12.0)

During testing, I've seen a number of acpi/powersave issues 
(PC stalls even if the card is not plugged in).

But I am unable to reproduce any "connection stall",
even after 250GiB (in both directions) and throughputs
as high as 30 Mbits/s the link is still up.

Is there a specific way to trigger the condition?

Does the card LEDs still react to simple LED trigger events
(e.g: echo 0 or 1 > /sys/class/leds/p54*/brightness) afterwards
or is it necessary to do a
 - ifdown/ifup cycle
 - unplug, replug the card
 - system reboot

@edpeur: Do you need to apply all patches? Or is there one patch
which fixes the issue?

BTW:
http://patchwork.kernel.org/patch/74486/

Comment 22 edpeur 2010-02-06 08:15:12 UTC

p54pci-handle-dma-mapping-errors.patch crashes
p54pci-move-tx-cleanup-into-tasklet.patch works
p54pci-rx-frame-length-check.patch crashes

Comment 23 Sean Young 2010-02-08 21:49:03 UTC

(In reply to comment #21)
> I'm still looking into Sean's "connection hang" issue.
> 
> The testing system is:
> IBM Tablet X41 (Pentium M throttled to ~200MHz)
> PC Card is an old Netgear WG511 (Prism54 Full MAC).
> Firmware MD5: ff7536af2092b1c4b21315bd103ef4c4 (2.13.12.0)
> 
> During testing, I've seen a number of acpi/powersave issues 
> (PC stalls even if the card is not plugged in).
> 
> But I am unable to reproduce any "connection stall",
> even after 250GiB (in both directions) and throughputs
> as high as 30 Mbits/s the link is still up.
> 
> Is there a specific way to trigger the condition?

I'm running an apache on the local lan and I run:

wget -O /dev/null http://lan/linux-2.6.32.3.tar.bz2

and it reliably it stalls after 19MiB.

> Does the card LEDs still react to simple LED trigger events
> (e.g: echo 0 or 1 > /sys/class/leds/p54*/brightness) afterwards

No, it does not respond.

> or is it necessary to do a
>  - ifdown/ifup cycle

After ifdown/ifup it works again and the leds respond to changing
the values in sysfs.

>  - unplug, replug the card
>  - system reboot

Not needed.

> BTW:
> http://patchwork.kernel.org/patch/74486/

I've tried this one as well, no chance.

I don't use this computer very much any more, nor do I use the card very much. Although it would be nice to know bugs are fixed, this doesn't matter much any more.

Christian: if you wish I can send this computer + wireless card to you.

Comment 24 Christian Lamparter 2010-03-15 11:54:46 UTC

(In reply to comment #23)
> (In reply to comment #21)
> > I'm still looking into Sean's "connection hang" issue.
> > 
> > The testing system is:
> > IBM Tablet X41 (Pentium M throttled to ~200MHz)
> > PC Card is an old Netgear WG511 (Prism54 Full MAC).
> > Firmware MD5: ff7536af2092b1c4b21315bd103ef4c4 (2.13.12.0)
> > 
> > During testing, I've seen a number of acpi/powersave issues 
> > (PC stalls even if the card is not plugged in).
> > 
> > But I am unable to reproduce any "connection stall",
> > even after 250GiB (in both directions) and throughputs
> > as high as 30 Mbits/s the link is still up.
> > 
> > Is there a specific way to trigger the condition?
> 
> I'm running an apache on the local lan and I run:
> 
> wget -O /dev/null http://lan/linux-2.6.32.3.tar.bz2
> 
> and it reliably it stalls after 19MiB.
hmm, I only tried iperf
(I'll update this, once I found some space for apache)

> > Does the card LEDs still react to simple LED trigger events
> > (e.g: echo 0 or 1 > /sys/class/leds/p54*/brightness) afterwards
> 
> No, it does not respond.
> 
> > or is it necessary to do a
> >  - ifdown/ifup cycle
> 
> After ifdown/ifup it works again and the leds respond to changing
> the values in sysfs.

there's a clue... Sounds a bit like stuck tx frames.
Ar9170 has similar problems, so I'll look if I can
copy the routines from there.

> >  - unplug, replug the card
> >  - system reboot
> 
> Not needed.
> 
> > BTW:
> > http://patchwork.kernel.org/patch/74486/
> 
> I've tried this one as well, no chance.
> 
> I don't use this computer very much any more, nor do I use the card very
> much.
> Although it would be nice to know bugs are fixed, this doesn't matter much
> any
> more.
> 
> Christian: if you wish I can send this computer + wireless card to you.
chunkeey@googlemail.com, if you are still interested ;-)

Comment 25 Alan 2012-10-30 15:03:27 UTC

If this is still seen on modern kernels then please re-open/update