Bug 11386
Summary: | p54 always causes BUG under high traffic in interrupt handler | ||
---|---|---|---|
Product: | Drivers | Reporter: | Sean Young (sean) |
Component: | network-wireless | Assignee: | Christian Lamparter (chunkeey) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | alan, chunkeey, colinf, edpeur |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.34 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
p54-driver with new pci firmware
stop queue stop |
Description
Sean Young
2008-08-20 13:15:31 UTC
Looks like p54pci.c around line 338: while (i != idx) { u16 len; struct sk_buff *skb; desc = &ring_control->rx_data[i]; len = le16_to_cpu(desc->len); skb = priv->rx_buf[i]; skb_put(skb, len); So, probably a buffer starvation problem under high traffic. I'll try to dig deeper later -- anyone else is welcome to analyze further in the meantime. Hmm, this bug could be fixed in the wireless-next. 7262d59366f972b898ea134639112d34bcac35b3 ("p54pci: rx tasklet refactoring") (which is actually a resend of the old "[RFC][PATCH 2/4] p54: p54pci updates" back in april) BTW, if you switch to the wireless-next git, there is another pending patch "[PATCH] p54pci: increase ring buffer index counter when skipping" (see linux-wireless mailing-list archive for this one) which you probably want as well. Regards, Chr. I found the patch from the mailing-list archive here: http://thread.gmane.org/gmane.linux.kernel.wireless.general/19977 It seems to be incomplete; there is at least a closing brace missing. I've tested it by assuming all that is missing is a closing brace (on top of the first patch). Either way after a short period of data reception all wireless activity stops. There is no error in dmesg. hmm, http://marc.info/?l=linux-wireless&m=122021355927239 is complete. So, wrong alarm I guess... Anyway, the "all wireless activity stops" (but no oops or any hint what could be wrong this time) could be anything from a firmware crash, an unhandled oom/slow cpu case, or problem with multiqueue(but since you hopefully pulled the latest wireless-next this one is unlikely). well so, a shot in the dark... what happens if you lower the MTU with ifconfig down to 1300 or less? Regards, Chr I hadn't pulled wireless-next; I only applied the two patches mentioned under comment #2 to the latest Linus' git tree. Now I've pulled wireless-next and applied the second patch. Same effect: wireless activity stops with no hint. Note that this CPU is slow but very little apart from ftp is running. Lowering the mtu has no effect, with neither trees. Created attachment 17600 [details]
p54-driver with new pci firmware
out-of-tree variant... (kernel build environment necessary)
just extract, put the firmware into the right place (/lib/firmware ?), run make, make unload, make load and test...
Alright... so let's see if a new firmware helps in your case. I've already attached the latest driver code with a new firmware that is known to perform better without crashing on low ressources. (as said in the comment, to #6, it has a make file to build it out-of-tree, so you don't have to look for all patches that are scattered around on the Mailing-List). I built the attached driver in wireless-next and after 4MB wireless activity again stopped. It was not built as a module and WEP is enabled. Also the new firmware was in /lib/firmware/$(uname -r)/. I also tried with WEP disabled (just open) and the same thing happened. I wouldn't mind digging into this myself, but am a bit lost for how this thing actually works. I couldn't find any documentation -- is there any available? p54pci 0000:01:00.0: enabling device (0000 -> 0002) p54pci 0000:01:00.0: setting latency timer to 64 firmware: requesting isl3886 p54: LM86 firmware p54: FW rev 2.13.1.0 - Softmac protocol 5.5 p54: unknown eeprom code : 0x1 p54: unknown eeprom code : 0x1007 p54: unknown eeprom code : 0x1008 p54: unknown eeprom code : 0x1100 p54: unknown eeprom code : 0x3 p54: unknown eeprom code : 0x1905 phy0: hwaddr 00:04:e2:aa:48:d2, MAC:isl3890 RF:Frisbee phy0: Selected rate control algorithm 'pid' firmware: requesting isl3886 wlan0: authenticate with AP 00:14:7f:30:17:e9 wlan0: authenticated wlan0: associate with AP 00:14:7f:30:17:e9 wlan0: RX AssocResp from 00:14:7f:30:17:e9 (capab=0x411 status=0 aid=1) wlan0: associated Thanks Not much documentation. mostly windows driver and usb-snoop. but you can find some information in the old "islsm" driver http://islsm.org/wiki/ but your dmesg looks a bit suspicious / truncated... Isn't mac80211-stack trying to reconnect? normally it should be full with proberesp & authentication timeouts... does iwlist wlan0 scan still shows your AP? Created attachment 17626 [details]
stop queue stop
well, could be... the downside of this workaround is package-loss
I've waited for 5 minutes but mac80211-stack isn't trying to reconnect. iwconfig is still showing it is connected. iwlist wlan0 scan shows: sean@tiger:~$ iwlist wlan0 scan wlan0 Scan completed : Cell 01 - Address: 00:14:7F:30:17:E9 ESSID:"44 Millbrooke Court" Mode:Master Channel:3 Frequency:2.422 GHz (Channel 3) Quality=64/100 Signal level:82/127 Encryption key:on IE: Unknown: 00133434204D696C6C62726F6F6B6520436F757274 IE: Unknown: 010882848B962430486C IE: Unknown: 030103 IE: Unknown: 2A0100 IE: Unknown: 2F0100 IE: Unknown: 32040C121860 IE: Unknown: DD06001018020000 IE: Unknown: DD180050F2020101080003A4000027A4000042435E00620 Bit Rates:1 Mb/s; 2 Mb/s; 5.5 Mb/s; 11 Mb/s; 18 Mb/s 24 Mb/s; 36 Mb/s; 54 Mb/s; 6 Mb/s; 9 Mb/s 12 Mb/s; 48 Mb/s Extra:tsf=0000000024fe5184 Extra: Last beacon: 100ms ago Additionally, if I bring the interface down & up again, all is well again. With the patch the connection stay alive for longer. I could download 56MiB. hmm: iwlist looks a bit suspicious... (iwlist seems to be a bit outdated?) and more worrying: the low tsf value tells me that your accesspoint has an uptime of just about 5-8 minutes? (are these the same 5 minutes you have waited after the traffic died?) I had just turned wep on again on the accesspoint. The accesspoint is used lots from a laptop and seems to work fine on that one, at least. Note also that with the original prism54 it also works fine with the same wireless card. I'll try upgrading wireless-tools. Well, do you have a extra wifi card that you can put into monitor mode and capture the last packages of p54 before the link dies? BTW: the prism54 vs. p54 is a bit of a apples and oranges comparison. p54 is a pure softmac driver that relies on mac80211 to scan, assoc, encryption & make 802.11 frames out of 802.3(ethernet), while prism54 just looks like a normal ethernet to the kernel with some wireless ioctls. and you should really see at least missing probe responses & authentication timeouts right after the link dies in the dmesg. but there isn't any... I've just tried 2.6.32.3 and I can no longer reproduce the problem. The leds on the PCMCIA card aren't blinking any more though. Great... It's been a loooong long time. About your LEDs: The LEDs are now controlled by software/mac80211-stack/user exclusivly. CONFIG_P54_LEDS must be selected (This is done automatically if CONFIG_MAC80211_LEDS & CONFIG_LEDS_CLASS are available to the driver) in order to get any sort of visual feedback. Regards, Chr Changing the .config fixes the LEDs. Unfortunately the original problem still exists. It happens with WEP, WPA and no encryption. Here is a tcpdump from another machine: http://www.msxnet.org/tcpdump.1.gz 192.168.1.13 is the machine with the fault card 192.168.1.1 is the machine a download via http is done. After ~5MB the wireless stops working. I'm not sure how to interpret the dump. Interesting, I assume that enabling the LEDs has caused the bug to reappear? I just posted a 3 patches for p54pci: [1/3] http://patchwork.kernel.org/patch/73555/ (click on "Download") [2/3] http://patchwork.kernel.org/patch/73556/ [3/3] http://patchwork.kernel.org/patch/73561/ Please let me know if they help, or if I have to dig deeper. Regards, Chr Unfortunately having the LEDs config enabled or not makes no difference. I've tried 2.6.32.3 with and without LED enabled in .config, and after ~20G the connection just hangs. I've also tried 2.6.33-rc4 with the three patches above. Still it hangs after about 20G (sometimes 13G, sometimes 35G). There is nothing logged in dmesg. I've tried these firmware files: root@tiger:/lib/firmware# md5sum isl3886pci isl3886 ff7536af2092b1c4b21315bd103ef4c4 isl3886pci 8ff41cff31c9323330d6170b54735477 isl3886 So nothing I've tried made it work I'm afraid. Anything else I can try? Thanks! Christian, thank you for these patches as they allow my computer to not crash. Without these patches it was crashing within few seconds of a high speed transfer. I'm still looking into Sean's "connection hang" issue. The testing system is: IBM Tablet X41 (Pentium M throttled to ~200MHz) PC Card is an old Netgear WG511 (Prism54 Full MAC). Firmware MD5: ff7536af2092b1c4b21315bd103ef4c4 (2.13.12.0) During testing, I've seen a number of acpi/powersave issues (PC stalls even if the card is not plugged in). But I am unable to reproduce any "connection stall", even after 250GiB (in both directions) and throughputs as high as 30 Mbits/s the link is still up. Is there a specific way to trigger the condition? Does the card LEDs still react to simple LED trigger events (e.g: echo 0 or 1 > /sys/class/leds/p54*/brightness) afterwards or is it necessary to do a - ifdown/ifup cycle - unplug, replug the card - system reboot @edpeur: Do you need to apply all patches? Or is there one patch which fixes the issue? BTW: http://patchwork.kernel.org/patch/74486/ p54pci-handle-dma-mapping-errors.patch crashes p54pci-move-tx-cleanup-into-tasklet.patch works p54pci-rx-frame-length-check.patch crashes (In reply to comment #21) > I'm still looking into Sean's "connection hang" issue. > > The testing system is: > IBM Tablet X41 (Pentium M throttled to ~200MHz) > PC Card is an old Netgear WG511 (Prism54 Full MAC). > Firmware MD5: ff7536af2092b1c4b21315bd103ef4c4 (2.13.12.0) > > During testing, I've seen a number of acpi/powersave issues > (PC stalls even if the card is not plugged in). > > But I am unable to reproduce any "connection stall", > even after 250GiB (in both directions) and throughputs > as high as 30 Mbits/s the link is still up. > > Is there a specific way to trigger the condition? I'm running an apache on the local lan and I run: wget -O /dev/null http://lan/linux-2.6.32.3.tar.bz2 and it reliably it stalls after 19MiB. > Does the card LEDs still react to simple LED trigger events > (e.g: echo 0 or 1 > /sys/class/leds/p54*/brightness) afterwards No, it does not respond. > or is it necessary to do a > - ifdown/ifup cycle After ifdown/ifup it works again and the leds respond to changing the values in sysfs. > - unplug, replug the card > - system reboot Not needed. > BTW: > http://patchwork.kernel.org/patch/74486/ I've tried this one as well, no chance. I don't use this computer very much any more, nor do I use the card very much. Although it would be nice to know bugs are fixed, this doesn't matter much any more. Christian: if you wish I can send this computer + wireless card to you. (In reply to comment #23) > (In reply to comment #21) > > I'm still looking into Sean's "connection hang" issue. > > > > The testing system is: > > IBM Tablet X41 (Pentium M throttled to ~200MHz) > > PC Card is an old Netgear WG511 (Prism54 Full MAC). > > Firmware MD5: ff7536af2092b1c4b21315bd103ef4c4 (2.13.12.0) > > > > During testing, I've seen a number of acpi/powersave issues > > (PC stalls even if the card is not plugged in). > > > > But I am unable to reproduce any "connection stall", > > even after 250GiB (in both directions) and throughputs > > as high as 30 Mbits/s the link is still up. > > > > Is there a specific way to trigger the condition? > > I'm running an apache on the local lan and I run: > > wget -O /dev/null http://lan/linux-2.6.32.3.tar.bz2 > > and it reliably it stalls after 19MiB. hmm, I only tried iperf (I'll update this, once I found some space for apache) > > Does the card LEDs still react to simple LED trigger events > > (e.g: echo 0 or 1 > /sys/class/leds/p54*/brightness) afterwards > > No, it does not respond. > > > or is it necessary to do a > > - ifdown/ifup cycle > > After ifdown/ifup it works again and the leds respond to changing > the values in sysfs. there's a clue... Sounds a bit like stuck tx frames. Ar9170 has similar problems, so I'll look if I can copy the routines from there. > > - unplug, replug the card > > - system reboot > > Not needed. > > > BTW: > > http://patchwork.kernel.org/patch/74486/ > > I've tried this one as well, no chance. > > I don't use this computer very much any more, nor do I use the card very > much. > Although it would be nice to know bugs are fixed, this doesn't matter much > any > more. > > Christian: if you wish I can send this computer + wireless card to you. chunkeey@googlemail.com, if you are still interested ;-) If this is still seen on modern kernels then please re-open/update |