Bug 15861

Summary: ath5k in monitor mode: Poison overwritten on buffers allocated from ath_rxbuf_alloc
Product: Drivers Reporter: Pedro Francisco (pedrogfrancisco)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED UNREPRODUCIBLE    
Severity: normal CC: br1, linville, me, mickflemm, pedrogfrancisco
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34 Subsystem:
Regression: No Bisected commit-id:
Attachments: syslog of 2.6.34-daily-20100421
patch for rx desc size
syslog
objdump -S net/wireless/cfg80211.ko -- w/o debugging symbols
2.6.34 vanilla
dmesg of 2.6.34 w/ Bruno's patch applied
dmesg of 2.6.34 w/ Bruno's patch applied, 2nd warning on same boot
syslog of 2.6.34 w/ Bruno's patch applied, new boot
dmesg of 2.6.34+patch+wmb
better checks for rx dma
2.6.34+patch+wmb+patch2 - syslog
disable tasklets on reset
20100721-syslog.txt
20100721-wireshark.pcap
ath5k-2.6.36-rc4

Description Pedro Francisco 2010-04-27 10:59:58 UTC
How-to reproduce:
* Plug external wireless card with chipset handled by ath5k [ http://www.ubnt.com/src ]
* Boot Linux with slub_debug
* Remove all non-essential modules (all those possible at least)
* Start kismet
* Come back in the morning to check the logs.


syslog on attachment.

You'll notice at the end of the log a non directly related event. I'm assuming the card data is being written in various places it shouldn't causing major random screwups, so I've kept it for completeness sake.

P.S.: memtested for 7h, no errors.
Comment 1 Pedro Francisco 2010-04-27 11:01:47 UTC
Created attachment 26159 [details]
syslog of 2.6.34-daily-20100421
Comment 2 John W. Linville 2010-04-27 13:06:24 UTC
Poison overwritten on buffers allocated from ath_rxbuf_alloc...
Comment 3 John W. Linville 2010-05-13 15:19:48 UTC
ath5k guys, ping?
Comment 4 Bruno Randolf 2010-05-14 01:53:33 UTC
here is the mailing list thread with more infos:
http://marc.info/?l=linux-kernel&m=127200639509775&w=2
Comment 5 Bruno Randolf 2010-05-14 02:55:11 UTC
hey pedro, can you please tell us the exact chipset revisions as seen in dmesg and can you please load the module with debug output enabled ("modprobe ath5k debug=1") and then post the output of "dmesg|grep cachelsz". thanks
Comment 6 Bob Copeland 2010-05-14 03:11:02 UTC
On irc, nbd indicated we were missing a dma_sync somewhere when this first came up.  But we use the pci_xxx apis instead of the generic dma apis unlike ath9k.  Something to consider though.

PS, semester just ended so I can look at bugs again :)
Comment 7 Bruno Randolf 2010-05-14 03:58:41 UTC
there is a bug report for ath9k which looks *very* similar: http://marc.info/?l=linux-kernel&m=127379077422354&w=2
Comment 8 Bruno Randolf 2010-05-14 07:53:23 UTC
Created attachment 26375 [details]
patch for rx desc size

could you pleas try this patch and see if the problem persists?
Comment 9 Pedro Francisco 2010-05-14 10:18:52 UTC
I can't give you further info before Monday: the kernel I'm using hasn't ath5k debug support.

On Sunday night (GMT) I'll have the `dmesg| grep cachelsz' info and on Monday night I'll hopefully have tested the patch.
Comment 10 Pedro Francisco 2010-05-17 13:44:49 UTC
$ dmesg| grep cachelsz
ath5k phy2: (ath5k_rx_start:1681): cachelsz 64 rx_bufsize 2560

Will proceed testing tonight with 2.6.34 kernel + your patch.
Comment 11 Pedro Francisco 2010-05-18 07:34:48 UTC
Nope, still not working. I'm afraid the stacktrace stopped at the middle, but here it goes anyway.
Comment 12 Pedro Francisco 2010-05-18 07:35:24 UTC
Created attachment 26418 [details]
syslog
Comment 13 Bruno Randolf 2010-05-18 07:40:18 UTC
hmm, that looks like a different bug to me - no "Poison overwritten" message...

could you try the same kernel version you had the original bug with + only the patch? thanks!
Comment 14 Pedro Francisco 2010-05-18 07:49:06 UTC
I'd prefer removing the patch and see if the "Poison overwritten" reappears, if you don't mind, so as to not compile the whole kernel again?
Comment 15 Bob Copeland 2010-05-18 12:08:22 UTC
Just curious, could you also post output of "objdump -S net/wireless/cfg80211.ko" on this kernel with the null pointer dereference?  Stack trace seems to indicate slab_free is involved so it could still be slab corruption (was slub debug off this time?).  I would like to see what the registers correspond to.

By the way, ccache is a big help for making recompiles less painful.
Comment 16 Pedro Francisco 2010-05-18 14:42:49 UTC
I have been messing with GRUB options, I can't recall if I had `slub_debug' or `slub_debug=1' but I _believe_ I had one of them.

Regarding ccache I now use the inkernel `make all deb-pkg' instead of `make-kpkg' since the latter forces make clean, so ccache won't help much, but thanks for the pointer anyway :)
Comment 17 Pedro Francisco 2010-05-18 15:09:20 UTC
Created attachment 26423 [details]
objdump -S net/wireless/cfg80211.ko -- w/o debugging symbols
Comment 18 Bob Copeland 2010-05-18 21:01:09 UTC
The code appears to be in cmp_bss:

    595d:   39 c2                   cmp    %eax,%edx
    595f:   74 17                   je     5978 <cmp_bss+0x38>
    5961:   0f b7 40 04             movzwl 0x4(%eax),%eax
    5965:   0f b7 52 04             movzwl 0x4(%edx),%edx
    5969:   29 d0                   sub    %edx,%eax

with the IP at 0x5961.  I think that is this code:

241         if (a->channel != b->channel)
242                 return b->channel->center_freq - a->channel->center_freq;

where edx is b->channel and eax is a->channel.  eax looks corrupt or like a user space stack pointer if you're using a 1G/3G split.  I guess this is rb_find_bss being called from cfg80211_bss_update.

Let us know if you hit this again.  It could just be corruption of the cfg80211_internal_bss structure, possibly related to this bug, but it would be worrisome if there's some case where channel doesn't get initialized properly.
Comment 19 Pedro Francisco 2010-05-19 00:57:16 UTC
Created attachment 26425 [details]
2.6.34 vanilla
Comment 20 Pedro Francisco 2010-05-19 00:58:22 UTC
Created attachment 26426 [details]
dmesg of 2.6.34 w/ Bruno's patch applied
Comment 21 Pedro Francisco 2010-05-19 00:59:00 UTC
Created attachment 26427 [details]
dmesg of 2.6.34 w/ Bruno's patch applied, 2nd warning on same boot
Comment 22 Pedro Francisco 2010-05-19 00:59:32 UTC
Created attachment 26428 [details]
syslog of 2.6.34 w/ Bruno's patch applied, new boot
Comment 23 Pedro Francisco 2010-05-19 01:07:20 UTC
Since my neighbors decided to throw a party I took the liberty of not letting the computer running all night and saved on a per warning basis.

One dmesg without the patch, two with and a syslog with.

P.S.: I ran the files through iconv so Kate would let me cut extraneous text and only afterwards thought it had been a bad idea, so if the ASCII/UTF-8 part of the memory seems wrong it may be because of that.

Bruno, thanks for the IRC invite but at that hour it wasn't possible for me to stay at IRC.
Comment 24 Bruno Randolf 2010-05-19 01:10:17 UTC
ok, my patch doesn't fix the problem. i didn't really expect it to - just
hoping ;) but it was one thing i found while reviewing the code. it's better to
keep it in your kernel during further debugging, though, to avoid other kinds
of DMA problems, i think.
Comment 25 Bob Copeland 2010-05-19 12:19:43 UTC
This is kind of a long shot but we are definitely missing barriers around the buffer manipulations.  The device can see the new buffer as soon as we do ath5k_rxbuf_setup, but there's no guarantee that ds_link and ds_data will actually be changed in memory before sc->rxlink is updated (i.e. when it is added in the list).

Pedro, can you try adding a wmb(); right after the last line that Bruno added in his patch?
Comment 26 Pedro Francisco 2010-05-19 23:55:04 UTC
Created attachment 26447 [details]
dmesg of 2.6.34+patch+wmb

Nop, `wmb();' didn't work.
Comment 27 Bob Copeland 2010-05-20 03:58:43 UTC
Crud.  Maybe it'd be worth hacking up skb_release_data to store its stack trace near the end of the buffer and modify the poisoning check as well so we at least know the free path...
Comment 28 Pedro Francisco 2010-05-24 14:53:41 UTC
This gets particularly funny when I forget to remove my internal cards module, affects ieee80211_sta_work, iwl3945_irq_tasklet, ...

Is there anything else I can do? kmemcheck? Last time I tried my system hanged but I can try again...
Comment 29 Bruno Randolf 2010-05-26 03:02:43 UTC
Created attachment 26548 [details]
better checks for rx dma

hi! another patch to test. can you try this?
Comment 30 Bob Copeland 2010-05-26 12:47:29 UTC
I plan on also hacking up the skb_kfree tracepoint to record the call stack but I haven't had time to do so yet.  I guess I'll hold off until you can try Bruno's patch.
Comment 31 Pedro Francisco 2010-05-26 17:03:01 UTC
Ok ;)

Bruno, shall I apply it on top of the other patches (1 yours & another Copeland's) or just apply yours?
Comment 32 Bruno Randolf 2010-05-27 01:07:31 UTC
doesn't hurt if you keep the other patches.
Comment 33 Pedro Francisco 2010-05-27 18:18:46 UTC
Created attachment 26563 [details]
2.6.34+patch+wmb+patch2 - syslog

Didn't work, see attachment.

Just out of curiosity, could it be an hardware problem? I'm rather curious noone else reported it...
Comment 34 Bruno Randolf 2010-05-28 03:53:16 UTC
we have several other reports which point into the same direction. it could be hardware related, like affecting only specific chipsets, but i doubt it's a problem with your card. i think we have some bug in rx dma... did you see jumbo frames before the poison overwritten message? looks like that... every time?

have you told us which chipset you are using? can't find that info now. the output of lspci and dmesg|grep "ath5k.*chip", please... can you tell us the platform (laptop model?) too?
Comment 35 Pedro Francisco 2010-05-28 10:51:56 UTC
$ dmesg |grep "ath5.*chip"
ath5k phy1: Atheros AR5414 chip found (MAC: 0xa3, PHY: 0x61)

$ lspci |grep -i ath
04:00.0 Ethernet controller: Atheros Communications Inc. AR5001 Wireless Network Adapter (rev 01)

Card is a card bus wireless card, Ubiquiti SRX 300mW 802.11a/b/g MMCX
{ similar to the one in http://www.ubnt.com/src }

I'm assuming since it's a cardbus card the laptop model isn't needed, but here it goes: HP Pavilion dv6640ep .

Regarding the jumbo frames, they're fairly common but do not always appear. On two of the reports they don't appear at all, and on the others they vary from very frequent to occurring once.
Comment 36 Bruno Randolf 2010-06-10 02:17:16 UTC
Created attachment 26708 [details]
disable tasklets on reset

hi! one more patch to test, please... does this on help? thanks for your patience...
Comment 37 Pedro Francisco 2010-06-14 01:30:33 UTC
v2.6.35-rc3 seems to have filesystem/inode issues when I remove certain modules, rendering the system unusable, so I'm using v2.6.34 + 3ef2c0dd, so that your patch applies cleanly; also, I've dropped the other patches present in this bug report.

I'm currently compiling the kernel and will test tomorrow.
Comment 38 Pedro Francisco 2010-06-14 11:31:26 UTC
Ups compile failed. Will have to check on which patches 3ef2c0dd is dependent upon. Once I do that and test it I'll post again but at worst it will take more than a month.
Comment 39 Bruno Randolf 2010-06-15 00:52:04 UTC
hey! i don't know what you are referring to with 3ef2c0dd, i don't have such a commit in my tree. but it might be easier to stay with your old kernel version for this test. if you tell me which version that is, i'll create a patch for you. otherwise it's also easy to just manually add these lines at the beginning and end of ath5k_reset in drivers/net/wireless/ath/ath5k/base.c.

btw. this patch helpend in "my" bug with kismet:
https://bugzilla.kernel.org/show_bug.cgi?id=15998
Comment 40 Pedro Francisco 2010-07-21 10:49:48 UTC
Nope didn't work. I'll append the log later.

Also I've to check if Wireshark's assessment of wrong FCS from a certain frame till the end is related or not with this. It's possible it isn't since the pcap has further info on it which Wireshark may not support (pcap is actually PPI with GPS info which Wireshark doesn't seem to understand).


I needed to CHANGE the patch to comment out
tasklet_[en|dis]able(&sc->ani_tasklet);    /* ath5k_tasklet_ani */
since otherwise there would be a compile error.

Using 2.6.34.1 .
Comment 41 Pedro Francisco 2010-07-21 15:47:06 UTC
Ok, I've two groups of FCS errors:
Group A - from packet 67111 to the end (514910); I assume for now this is Wireshark misinterpreting GPS data as packet content, so I'll disregard it.

Group B - packets apparently related to Poison overwriting: all of their content is '6b 6b 6b 6b ...', before packet 67111 .

The time at which they happen, though, only matches for the first and second ones.
Wireshark's time of 1st '6b 6b 6b 6b ...' packet: ~284 sec
syslog's time of capture start: ~394 sec
syslog's time of 1st 'Poison overwritten': ~678 sec, hence ~284 sec since the capture start.

The 2nd '6b 6b 6b 6b ...' packet matches with the 3rd 'Poison overwritten' and the 3rd '6b 6b 6b 6b ...' packet has no match in syslog -- in the same timeframe, there are more 'Poison overwritten' than '6b 6b 6b 6b ...' packets.


I'm attaching only some of the first 'Poison overwritten' and the first 4 corrupt packets (the first one is correct - it's to maintain the timestamp since it is the first), maybe you can make some sense of it.
Comment 42 Pedro Francisco 2010-07-21 15:48:37 UTC
Created attachment 27185 [details]
20100721-syslog.txt
Comment 43 Pedro Francisco 2010-07-21 15:49:05 UTC
Created attachment 27186 [details]
20100721-wireshark.pcap
Comment 44 John W. Linville 2010-08-18 19:13:41 UTC
Bruno, ping?
Comment 45 Bob Copeland 2010-08-18 22:43:52 UTC
(In reply to comment #44)
> Bruno, ping?

I never got around to the skb_free patch, but we did
include the following commits in 2.6.36 to help this situation.

commit b3f194e54bdbaa4d508488cab24d23c376e235a2
Author: Bruno Randolf <br1@einfach.org>
Date:   Wed Jul 14 10:53:29 2010 +0900

    ath5k: clean up rxlink handling

commit 450464def78c94018d997ae6f823578499cdf879
Author: Bob Copeland <me@bobcopeland.com>
Date:   Tue Jul 13 11:32:41 2010 -0400

    ath5k: disable tasklets during reset

commit 5faaff747710dfb79d5aa72b9faface94ad4b3f3
Author: Bob Copeland <me@bobcopeland.com>
Date:   Tue Jul 13 11:32:40 2010 -0400

    ath5k: move reset to mac80211 workqueue

You tried the middle one but it would be a useful data point to know if the three in concert make any difference (might wait until 2.6.36-rc stabilizes a bit though).
Comment 46 Pedro Francisco 2010-09-15 12:27:14 UTC
Hi! I'm going to lose access to the hardware starting tomorrow so I'll test during the night with 2.6.36-rc4 and report back (I had been waiting for the stable release of 2.6.36).
Comment 47 John W. Linville 2010-09-15 18:18:39 UTC
Good luck with your testing!  Please report back here with the results...
Comment 48 Pedro Francisco 2010-09-16 09:18:08 UTC
Nope. Still has errors.

Well folks that's the end of it. Thank you for your time and patience.
Comment 49 Pedro Francisco 2010-09-16 09:19:01 UTC
Created attachment 30212 [details]
ath5k-2.6.36-rc4
Comment 50 Bruno Randolf 2010-09-16 09:22:30 UTC
thank you!!! and sorry we couldn't find a solution...