How-to reproduce: * Plug external wireless card with chipset handled by ath5k [ http://www.ubnt.com/src ] * Boot Linux with slub_debug * Remove all non-essential modules (all those possible at least) * Start kismet * Come back in the morning to check the logs. syslog on attachment. You'll notice at the end of the log a non directly related event. I'm assuming the card data is being written in various places it shouldn't causing major random screwups, so I've kept it for completeness sake. P.S.: memtested for 7h, no errors.
Created attachment 26159 [details] syslog of 2.6.34-daily-20100421
Poison overwritten on buffers allocated from ath_rxbuf_alloc...
ath5k guys, ping?
here is the mailing list thread with more infos: http://marc.info/?l=linux-kernel&m=127200639509775&w=2
hey pedro, can you please tell us the exact chipset revisions as seen in dmesg and can you please load the module with debug output enabled ("modprobe ath5k debug=1") and then post the output of "dmesg|grep cachelsz". thanks
On irc, nbd indicated we were missing a dma_sync somewhere when this first came up. But we use the pci_xxx apis instead of the generic dma apis unlike ath9k. Something to consider though. PS, semester just ended so I can look at bugs again :)
there is a bug report for ath9k which looks *very* similar: http://marc.info/?l=linux-kernel&m=127379077422354&w=2
Created attachment 26375 [details] patch for rx desc size could you pleas try this patch and see if the problem persists?
I can't give you further info before Monday: the kernel I'm using hasn't ath5k debug support. On Sunday night (GMT) I'll have the `dmesg| grep cachelsz' info and on Monday night I'll hopefully have tested the patch.
$ dmesg| grep cachelsz ath5k phy2: (ath5k_rx_start:1681): cachelsz 64 rx_bufsize 2560 Will proceed testing tonight with 2.6.34 kernel + your patch.
Nope, still not working. I'm afraid the stacktrace stopped at the middle, but here it goes anyway.
Created attachment 26418 [details] syslog
hmm, that looks like a different bug to me - no "Poison overwritten" message... could you try the same kernel version you had the original bug with + only the patch? thanks!
I'd prefer removing the patch and see if the "Poison overwritten" reappears, if you don't mind, so as to not compile the whole kernel again?
Just curious, could you also post output of "objdump -S net/wireless/cfg80211.ko" on this kernel with the null pointer dereference? Stack trace seems to indicate slab_free is involved so it could still be slab corruption (was slub debug off this time?). I would like to see what the registers correspond to. By the way, ccache is a big help for making recompiles less painful.
I have been messing with GRUB options, I can't recall if I had `slub_debug' or `slub_debug=1' but I _believe_ I had one of them. Regarding ccache I now use the inkernel `make all deb-pkg' instead of `make-kpkg' since the latter forces make clean, so ccache won't help much, but thanks for the pointer anyway :)
Created attachment 26423 [details] objdump -S net/wireless/cfg80211.ko -- w/o debugging symbols
The code appears to be in cmp_bss: 595d: 39 c2 cmp %eax,%edx 595f: 74 17 je 5978 <cmp_bss+0x38> 5961: 0f b7 40 04 movzwl 0x4(%eax),%eax 5965: 0f b7 52 04 movzwl 0x4(%edx),%edx 5969: 29 d0 sub %edx,%eax with the IP at 0x5961. I think that is this code: 241 if (a->channel != b->channel) 242 return b->channel->center_freq - a->channel->center_freq; where edx is b->channel and eax is a->channel. eax looks corrupt or like a user space stack pointer if you're using a 1G/3G split. I guess this is rb_find_bss being called from cfg80211_bss_update. Let us know if you hit this again. It could just be corruption of the cfg80211_internal_bss structure, possibly related to this bug, but it would be worrisome if there's some case where channel doesn't get initialized properly.
Created attachment 26425 [details] 2.6.34 vanilla
Created attachment 26426 [details] dmesg of 2.6.34 w/ Bruno's patch applied
Created attachment 26427 [details] dmesg of 2.6.34 w/ Bruno's patch applied, 2nd warning on same boot
Created attachment 26428 [details] syslog of 2.6.34 w/ Bruno's patch applied, new boot
Since my neighbors decided to throw a party I took the liberty of not letting the computer running all night and saved on a per warning basis. One dmesg without the patch, two with and a syslog with. P.S.: I ran the files through iconv so Kate would let me cut extraneous text and only afterwards thought it had been a bad idea, so if the ASCII/UTF-8 part of the memory seems wrong it may be because of that. Bruno, thanks for the IRC invite but at that hour it wasn't possible for me to stay at IRC.
ok, my patch doesn't fix the problem. i didn't really expect it to - just hoping ;) but it was one thing i found while reviewing the code. it's better to keep it in your kernel during further debugging, though, to avoid other kinds of DMA problems, i think.
This is kind of a long shot but we are definitely missing barriers around the buffer manipulations. The device can see the new buffer as soon as we do ath5k_rxbuf_setup, but there's no guarantee that ds_link and ds_data will actually be changed in memory before sc->rxlink is updated (i.e. when it is added in the list). Pedro, can you try adding a wmb(); right after the last line that Bruno added in his patch?
Created attachment 26447 [details] dmesg of 2.6.34+patch+wmb Nop, `wmb();' didn't work.
Crud. Maybe it'd be worth hacking up skb_release_data to store its stack trace near the end of the buffer and modify the poisoning check as well so we at least know the free path...
This gets particularly funny when I forget to remove my internal cards module, affects ieee80211_sta_work, iwl3945_irq_tasklet, ... Is there anything else I can do? kmemcheck? Last time I tried my system hanged but I can try again...
Created attachment 26548 [details] better checks for rx dma hi! another patch to test. can you try this?
I plan on also hacking up the skb_kfree tracepoint to record the call stack but I haven't had time to do so yet. I guess I'll hold off until you can try Bruno's patch.
Ok ;) Bruno, shall I apply it on top of the other patches (1 yours & another Copeland's) or just apply yours?
doesn't hurt if you keep the other patches.
Created attachment 26563 [details] 2.6.34+patch+wmb+patch2 - syslog Didn't work, see attachment. Just out of curiosity, could it be an hardware problem? I'm rather curious noone else reported it...
we have several other reports which point into the same direction. it could be hardware related, like affecting only specific chipsets, but i doubt it's a problem with your card. i think we have some bug in rx dma... did you see jumbo frames before the poison overwritten message? looks like that... every time? have you told us which chipset you are using? can't find that info now. the output of lspci and dmesg|grep "ath5k.*chip", please... can you tell us the platform (laptop model?) too?
$ dmesg |grep "ath5.*chip" ath5k phy1: Atheros AR5414 chip found (MAC: 0xa3, PHY: 0x61) $ lspci |grep -i ath 04:00.0 Ethernet controller: Atheros Communications Inc. AR5001 Wireless Network Adapter (rev 01) Card is a card bus wireless card, Ubiquiti SRX 300mW 802.11a/b/g MMCX { similar to the one in http://www.ubnt.com/src } I'm assuming since it's a cardbus card the laptop model isn't needed, but here it goes: HP Pavilion dv6640ep . Regarding the jumbo frames, they're fairly common but do not always appear. On two of the reports they don't appear at all, and on the others they vary from very frequent to occurring once.
Created attachment 26708 [details] disable tasklets on reset hi! one more patch to test, please... does this on help? thanks for your patience...
v2.6.35-rc3 seems to have filesystem/inode issues when I remove certain modules, rendering the system unusable, so I'm using v2.6.34 + 3ef2c0dd, so that your patch applies cleanly; also, I've dropped the other patches present in this bug report. I'm currently compiling the kernel and will test tomorrow.
Ups compile failed. Will have to check on which patches 3ef2c0dd is dependent upon. Once I do that and test it I'll post again but at worst it will take more than a month.
hey! i don't know what you are referring to with 3ef2c0dd, i don't have such a commit in my tree. but it might be easier to stay with your old kernel version for this test. if you tell me which version that is, i'll create a patch for you. otherwise it's also easy to just manually add these lines at the beginning and end of ath5k_reset in drivers/net/wireless/ath/ath5k/base.c. btw. this patch helpend in "my" bug with kismet: https://bugzilla.kernel.org/show_bug.cgi?id=15998
Nope didn't work. I'll append the log later. Also I've to check if Wireshark's assessment of wrong FCS from a certain frame till the end is related or not with this. It's possible it isn't since the pcap has further info on it which Wireshark may not support (pcap is actually PPI with GPS info which Wireshark doesn't seem to understand). I needed to CHANGE the patch to comment out tasklet_[en|dis]able(&sc->ani_tasklet); /* ath5k_tasklet_ani */ since otherwise there would be a compile error. Using 2.6.34.1 .
Ok, I've two groups of FCS errors: Group A - from packet 67111 to the end (514910); I assume for now this is Wireshark misinterpreting GPS data as packet content, so I'll disregard it. Group B - packets apparently related to Poison overwriting: all of their content is '6b 6b 6b 6b ...', before packet 67111 . The time at which they happen, though, only matches for the first and second ones. Wireshark's time of 1st '6b 6b 6b 6b ...' packet: ~284 sec syslog's time of capture start: ~394 sec syslog's time of 1st 'Poison overwritten': ~678 sec, hence ~284 sec since the capture start. The 2nd '6b 6b 6b 6b ...' packet matches with the 3rd 'Poison overwritten' and the 3rd '6b 6b 6b 6b ...' packet has no match in syslog -- in the same timeframe, there are more 'Poison overwritten' than '6b 6b 6b 6b ...' packets. I'm attaching only some of the first 'Poison overwritten' and the first 4 corrupt packets (the first one is correct - it's to maintain the timestamp since it is the first), maybe you can make some sense of it.
Created attachment 27185 [details] 20100721-syslog.txt
Created attachment 27186 [details] 20100721-wireshark.pcap
Bruno, ping?
(In reply to comment #44) > Bruno, ping? I never got around to the skb_free patch, but we did include the following commits in 2.6.36 to help this situation. commit b3f194e54bdbaa4d508488cab24d23c376e235a2 Author: Bruno Randolf <br1@einfach.org> Date: Wed Jul 14 10:53:29 2010 +0900 ath5k: clean up rxlink handling commit 450464def78c94018d997ae6f823578499cdf879 Author: Bob Copeland <me@bobcopeland.com> Date: Tue Jul 13 11:32:41 2010 -0400 ath5k: disable tasklets during reset commit 5faaff747710dfb79d5aa72b9faface94ad4b3f3 Author: Bob Copeland <me@bobcopeland.com> Date: Tue Jul 13 11:32:40 2010 -0400 ath5k: move reset to mac80211 workqueue You tried the middle one but it would be a useful data point to know if the three in concert make any difference (might wait until 2.6.36-rc stabilizes a bit though).
Hi! I'm going to lose access to the hardware starting tomorrow so I'll test during the night with 2.6.36-rc4 and report back (I had been waiting for the stable release of 2.6.36).
Good luck with your testing! Please report back here with the results...
Nope. Still has errors. Well folks that's the end of it. Thank you for your time and patience.
Created attachment 30212 [details] ath5k-2.6.36-rc4
thank you!!! and sorry we couldn't find a solution...