Bug 34992 (athk5)
|Summary:||Regression with ath5k, cannot find any wireless network|
|Product:||Drivers||Reporter:||Joshua Covington (joshuacov)|
|Severity:||high||CC:||dcbw, florian, linville, maciej.rutecki, me, mickflemm, pedretti.fabio, rjw|
|Bug Depends on:|
dmesg with DEBUG=Y
/var/log/messages with DEBUG=Y
/sys/kernel/debug/ieee80211 directory with debug=0xffffffff
kernel-panic after nm-applet starts
another kernel panic
another foto of the kernel-panic (seems to show more information than the previous fotos)
dmesg with slub_debug=FPZ
screen shot of the kernel oops with slub_debug=FPZ
2.6.39 + compat-wireless-2011-05-16 + Nick's patch
dmesg showing the network loss at about 266s
Description Joshua Covington 2011-05-12 11:24:29 UTC
Created attachment 57542 [details] dmesg with DEBUG=Y Description: My ath5k based card was working perfectly until I upgraded to the kernel-184.108.40.206-24.fc15.x86_64. Now it cannot see any wireless network. `iwlist wlan0 scan` returns "no wireless found" I tested this with the latest compat-wireless-2011-05-11 and DEBUG=Y without luck. The log/debug files are attached. Device: 08:04.0 Ethernet controller: Atheros Communications Inc. AR2413 802.11bg NIC (rev 01) Subsystem: AMBIT Microsystem Corp. Device 0418 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 168 (2500ns min, 7000ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 21 Region 0: Memory at c0200000 (32-bit, non-prefetchable) [size=64K] Capabilities:  Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME- Kernel driver in use: ath5k Kernel modules: ath5k Latest working kernel: kernel-2.6.37-2.fc15 First non-wokring kernel: kernel-2.6.38-0.rc2.git0.1.fc15
Comment 1 Joshua Covington 2011-05-12 11:26:02 UTC
Created attachment 57552 [details] /var/log/messages with DEBUG=Y I filtered the file for NetworkManager.
Comment 2 Joshua Covington 2011-05-12 11:28:39 UTC
Created attachment 57562 [details] /sys/kernel/debug/ieee80211 directory with debug=0xffffffff This is the /sys/kernel/debug/ieee80211 directory. I compiled the ath5k driver from compat-wireless-2011-05-11 with DEBUG=Y and this is the corresponding directory.
Comment 3 Nick Kossifidis 2011-05-14 18:46:18 UTC
Can you try out this patch ? http://www.kernel.org/pub/linux/kernel/people/mickflemm/01-fast-chan-switch-modparm
Comment 4 Joshua Covington 2011-05-15 00:10:01 UTC
Created attachment 57872 [details] kernel-panic after nm-applet starts The patch seems to work, because I can scan for networks. As you can see from the attached foto I get a kernel panic the moment nm-applet starts and tries to pass the password for a secured network. I'm not sure if this is now a problem of the driver itself or something connected to the NetworkManager. However it seems that everything is fine with the NerworkManager service but as I said when the nm-applet tries to pass a password, then the kernel locks up. What do you think - is this a bug in the driver or the nm-applet(NetworkManager)?
Comment 5 Joshua Covington 2011-05-15 00:24:23 UTC
Was the algorithm for processing 63-ASCII passwords changed between 2.6.37 and 2.6.38? This is the only logical explanation for me because a single program with user rights cannot cause a kernel-panic, can it?
Comment 6 Joshua Covington 2011-05-15 08:33:36 UTC
Created attachment 57912 [details] another kernel panic I updated to the latest wpa_supplicant-0.7.3-8.fc15 and now I have another kernel panic. I'm quite puzzled at the moment if this is a driver bug or not???
Comment 7 Joshua Covington 2011-05-15 08:34:51 UTC
In both cases the kernel panics when the nm-applet tries to make the connection.
Comment 8 Joshua Covington 2011-05-19 21:00:00 UTC
Created attachment 58662 [details] another foto of the kernel-panic (seems to show more information than the previous fotos) This is another foto of the kernel panic that shows a little bit more information than the oder ones. Does anyone have any idea what's causing this lockup?
Comment 9 Bob Copeland 2011-05-20 02:21:32 UTC
Well, I'm not sure what to make of the photos. rt_cache_flush (this is networking code nominally outside of the driver) seems to be implicated in 2 of them, while the other looks like it's just randomly in default_idle (but I can't tell what the actual error is in the latter). Can you turn on slab/slub debugging and lockdep if they aren't already on?
Comment 10 Joshua Covington 2011-05-20 04:05:05 UTC
I looked into the config file ans found the following: CONFIG_LOCKDEP_SUPPORT=y # CONFIG_SLAB is not set CONFIG_SLABINFO=y CONFIG_SLUB_DEBUG=y CONFIG_SLUB=y # CONFIG_SLUB_DEBUG_ON is not set # CONFIG_SLUB_STATS is not set So some of them have been already set. Should I still set something (like debug level etc...) and if I have to recompile the kernel are there any other options that need to be set? I don't like recompiling the kernel because it takes appx 55 mins and I'd like to set everything that could help (and save some recompiling). Are there any shortcuts for this kernel portion instead of recompiling everything? Maybe just the wireless stuff?
Comment 11 Bob Copeland 2011-05-20 11:23:28 UTC
(In reply to comment #10) > So some of them have been already set. Should I still set something (like > debug > level etc...) Yeah - please boot with kernel param "slub_debug=FPZ"
Comment 12 Joshua Covington 2011-05-20 14:52:45 UTC
Created attachment 58772 [details] dmesg with slub_debug=FPZ There's nothing unusual here.
Comment 13 Joshua Covington 2011-05-20 14:54:22 UTC
Created attachment 58792 [details] screen shot of the kernel oops with slub_debug=FPZ I hope this one can help.
Comment 14 Bob Copeland 2011-05-20 16:27:48 UTC
Ok that does help, yes... So it looks like we were in the middle of adding the debugfs file for the STA, then you got a timer interrupt, it ran softirqs, which caused rcu_process_callbacks to run... somewhere in there IP got set to 0x10 so there is maybe a dangling pointer, or a race with the debugfs code. Not sure yet what might be in the callback.
Comment 15 Bob Copeland 2011-05-20 16:37:24 UTC
Did you apply Nick's patch to compat-wireless or did you rebuild the whole kernel? There seem to be some issues with the backports of some rcu code in compat-wireless where the structures have different layouts. If that's the case, can you try testing vanilla kernel.org kernel + Nick's patch?
Comment 16 Joshua Covington 2011-05-20 17:44:08 UTC
All the output is from kernel-220.127.116.11-24.fc15.x86_64 + compat-wireless-2011-05-11 + Nick's patch. It's a lot easier to use this than recompiling everything. I'll try to use some of the stable releases available here: http://wireless.kernel.org/en/users/Download/stable/ because it takes such a long time to rebuild the whole kernel in my case. I'll let you know what works. Thanks for the fast response!
Comment 17 Joshua Covington 2011-05-20 18:16:46 UTC
Created attachment 58812 [details] dmesg debug=0xffffffff This is the output from 18.104.22.168-24.fc15.x86_64 + compat-wireless-2.6.39-rc6-1-sp.tar.bz2 + Nick's patch. ATH5K_DEBUG=Y Everything wokrs again. It looks like those issues with the backports of some rcu code in compat-wireless you were talking about have either been fixed in the 2.6.39 or still not applied to the stable branch. So what should we do: 1. research further if those patches are to be added to the stable release (because they'll cause the lock again) or 2. close this bug IF those backports have already been fixed?
Comment 18 Joshua Covington 2011-05-21 07:51:13 UTC
Created attachment 58852 [details] 2.6.39 + compat-wireless-2011-05-16 + Nick's patch I decided to test this further and used the kernel-2.6.39-0.fc16.x86_64 + compat-wireless-2011-05-16 + Nick's patch. Obviously these rcu changes still exists in the upstream compat-wireless. Now the backtrace shows some signs of the faulty rcu structure. I hope this is fixed soon or before it lands in the mainstream kernel.
Comment 19 Bob Copeland 2011-05-22 21:55:19 UTC
I don't think it's a mainline problem since structure layouts and APIs by definition are the same for any single version of the kernel. That is, if you can reproduce with a vanilla kernel.org kernel + Nick's patch, then it is something we should look at, otherwise it is something compat-wireless needs to address.
Comment 20 Joshua Covington 2011-05-23 16:50:41 UTC
Ok, As I said in comment #17 everything is working again now with Nick's patch and therefore I propose to close this bug. Can you please report when this is submitted and accepted in the vanilla kernel?
Comment 21 Florian Mickler 2011-05-23 17:11:08 UTC
Comment 22 John W. Linville 2011-05-31 17:15:18 UTC
Nick, are you planning to submit this patch? Or did I miss it?
Comment 23 Nick Kossifidis 2011-05-31 17:19:08 UTC
Sorry for the delay :-( I've been very busy lately, I'll post it tomorrow...
Comment 24 Fabio Pedretti 2011-06-04 22:21:56 UTC
Was the patched pushed somewhere?
Comment 25 Joshua Covington 2011-06-05 01:58:49 UTC
yes, finally: author Nick Kossifidis <firstname.lastname@example.org> Thu, 2 Jun 2011 00:09:48 +0000 (03:09 +0300) committer John W. Linville <email@example.com> Fri, 3 Jun 2011 18:19:49 +0000 (14:19 -0400) commit a99168eece601d2a79ecfcb968ce226f2f30cf98 tree 01598dfa43a08038f9b33cdae902f71156647471 parent bdf492f502ad4f646e9905db1b89e11822826edd It was pushed to wireless-2.6.git and wireless-testing.git but it missed the 22.214.171.124 (too late). I hope it lands in 126.96.36.199 and newer because all of the latest versions (incl 2.6.39 and 3.0) have this problem.
Comment 26 John W. Linville 2011-06-06 15:01:05 UTC
I neglected to add 'Cc: firstname.lastname@example.org' to the commit message, but I sent something to that address today requesting inclusion in 2.6.38.y.
Comment 27 Fabio Pedretti 2011-06-15 13:27:52 UTC
Created attachment 62132 [details] dmesg showing the network loss at about 266s I am still having this problem, maybe it's a different bug but I'll report here anyway, let me know if I have to open a new one. The problem is that I am able to connect to the wireless network (using NetworkManager from Ubuntu 11.04), but after about 210s I lost the connection and I am no longer able to reconnect. I have to wait half an hour then I can reconnect for another ~200s and so on. I am using 188.8.131.52 with the patch from this bug + this patch http://patches.aircrack-ng.org/ath5k_regdomain_override.patch (to use 30dB power but it makes no difference) + this https://patchwork.kernel.org/patch/103589/ (to avoid the -1 error with airmon). I enabled the following ath5k options: CONFIG_ATH5K=m CONFIG_ATH5K_DEBUG=y CONFIG_ATH5K_TRACER=y CONFIG_ATH5K_PCI=y I have to say I connect to a public AP with WPA which is far from my home (I get about 70~90dB) but it always worked fine up to some time ago. I am attaching my dmesg, note the gap between 53-266 where the network worked, then it fails and try to reconnect without success.
Comment 28 Florian Mickler 2011-06-15 15:55:22 UTC
Yes, please file a new bug. This bug has been solved by the patch you have already applied. So you do not experience _this_ bug. Please file a new bug with an appropriate description. If you can determine a kernel release that does not exhibit your problem, then mark it as a regression and make it block the corresponding tracker bug (see bug #15790). Regards, Flo p.s.: for completeness sake you can post that bug number here, so that any interested party can follow...
Comment 30 Joshua Covington 2011-06-26 18:51:46 UTC
(In reply to comment #26) > I neglected to add 'Cc: email@example.com' to the commit message, but I sent > something to that address today requesting inclusion in 2.6.38.y. Sadly, this was left out of the current 184.108.40.206 and I bet it won't be in 220.127.116.11 either.
Comment 31 Joshua Covington 2011-07-09 07:40:11 UTC
Commit bdc5ce7ef6b7a4aa7a9ae7c60767783e6c5e438a in 18.104.22.168 and commit a99168eece601d2a79ecfcb968ce226f2f30cf98 upstream. Thanks.