Kernel Bug Tracker – Bug 34992
Regression with ath5k, cannot find any wireless network
Last modified: 2011-07-09 07:40:11 UTC
Created attachment 57542 [details]
dmesg with DEBUG=Y
My ath5k based card was working perfectly until I upgraded to the kernel-18.104.22.168-24.fc15.x86_64. Now it cannot see any wireless network.
`iwlist wlan0 scan` returns "no wireless found"
I tested this with the latest compat-wireless-2011-05-11 and DEBUG=Y without luck. The log/debug files are attached.
08:04.0 Ethernet controller: Atheros Communications Inc. AR2413 802.11bg NIC (rev 01)
Subsystem: AMBIT Microsystem Corp. Device 0418
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 168 (2500ns min, 7000ns max), Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 21
Region 0: Memory at c0200000 (32-bit, non-prefetchable) [size=64K]
Capabilities:  Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=2 PME-
Kernel driver in use: ath5k
Kernel modules: ath5k
Latest working kernel:
First non-wokring kernel:
Created attachment 57552 [details]
/var/log/messages with DEBUG=Y
I filtered the file for NetworkManager.
Created attachment 57562 [details]
/sys/kernel/debug/ieee80211 directory with debug=0xffffffff
This is the /sys/kernel/debug/ieee80211 directory. I compiled the ath5k driver from compat-wireless-2011-05-11 with DEBUG=Y and this is the corresponding directory.
Can you try out this patch ?
Created attachment 57872 [details]
kernel-panic after nm-applet starts
The patch seems to work, because I can scan for networks.
As you can see from the attached foto I get a kernel panic the moment nm-applet starts and tries to pass the password for a secured network. I'm not sure if this is now a problem of the driver itself or something connected to the NetworkManager.
However it seems that everything is fine with the NerworkManager service but as I said when the nm-applet tries to pass a password, then the kernel locks up. What do you think - is this a bug in the driver or the nm-applet(NetworkManager)?
Was the algorithm for processing 63-ASCII passwords changed between 2.6.37 and 2.6.38?
This is the only logical explanation for me because a single program with user rights cannot cause a kernel-panic, can it?
Created attachment 57912 [details]
another kernel panic
I updated to the latest wpa_supplicant-0.7.3-8.fc15 and now I have another kernel panic. I'm quite puzzled at the moment if this is a driver bug or not???
In both cases the kernel panics when the nm-applet tries to make the connection.
Created attachment 58662 [details]
another foto of the kernel-panic (seems to show more information than the previous fotos)
This is another foto of the kernel panic that shows a little bit more information than the oder ones.
Does anyone have any idea what's causing this lockup?
Well, I'm not sure what to make of the photos. rt_cache_flush (this is networking code nominally outside of the driver) seems to be implicated in 2 of them, while the other looks like it's just randomly in default_idle (but I can't tell what the actual error is in the latter).
Can you turn on slab/slub debugging and lockdep if they aren't already on?
I looked into the config file ans found the following:
# CONFIG_SLAB is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
So some of them have been already set. Should I still set something (like debug level etc...) and if I have to recompile the kernel are there any other options that need to be set? I don't like recompiling the kernel because it takes appx 55 mins and I'd like to set everything that could help (and save some recompiling).
Are there any shortcuts for this kernel portion instead of recompiling everything? Maybe just the wireless stuff?
(In reply to comment #10)
> So some of them have been already set. Should I still set something (like debug
> level etc...)
Yeah - please boot with kernel param "slub_debug=FPZ"
Created attachment 58772 [details]
dmesg with slub_debug=FPZ
There's nothing unusual here.
Created attachment 58792 [details]
screen shot of the kernel oops with slub_debug=FPZ
I hope this one can help.
Ok that does help, yes...
So it looks like we were in the middle of adding the debugfs file for the STA, then you got a timer interrupt, it ran softirqs, which caused rcu_process_callbacks to run... somewhere in there IP got set to 0x10 so there is maybe a dangling pointer, or a race with the debugfs code. Not sure yet what might be in the callback.
Did you apply Nick's patch to compat-wireless or did you rebuild the whole kernel? There seem to be some issues with the backports of some rcu code in compat-wireless where the structures have different layouts. If that's the case, can you try testing vanilla kernel.org kernel + Nick's patch?
All the output is from kernel-22.214.171.124-24.fc15.x86_64 + compat-wireless-2011-05-11 + Nick's patch. It's a lot easier to use this than recompiling everything.
I'll try to use some of the stable releases available here: http://wireless.kernel.org/en/users/Download/stable/ because it takes such a long time to rebuild the whole kernel in my case.
I'll let you know what works. Thanks for the fast response!
Created attachment 58812 [details]
This is the output from 126.96.36.199-24.fc15.x86_64 + compat-wireless-2.6.39-rc6-1-sp.tar.bz2 + Nick's patch. ATH5K_DEBUG=Y
Everything wokrs again.
It looks like those issues with the backports of some rcu code in
compat-wireless you were talking about have either been fixed in the 2.6.39 or still not applied to the stable branch.
So what should we do:
1. research further if those patches are to be added to the stable release (because they'll cause the lock again) or
2. close this bug IF those backports have already been fixed?
Created attachment 58852 [details]
2.6.39 + compat-wireless-2011-05-16 + Nick's patch
I decided to test this further and used the kernel-2.6.39-0.fc16.x86_64 + compat-wireless-2011-05-16 + Nick's patch.
Obviously these rcu changes still exists in the upstream compat-wireless. Now the backtrace shows some signs of the faulty rcu structure. I hope this is fixed soon or before it lands in the mainstream kernel.
I don't think it's a mainline problem since structure layouts and APIs by definition are the same for any single version of the kernel. That is, if you can reproduce with a vanilla kernel.org kernel + Nick's patch, then it is something we should look at, otherwise it is something compat-wireless needs to address.
As I said in comment #17 everything is working again now with Nick's patch and therefore I propose to close this bug.
Can you please report when this is submitted and accepted in the vanilla kernel?
Nick, are you planning to submit this patch? Or did I miss it?
Sorry for the delay :-( I've been very busy lately, I'll post it tomorrow...
Was the patched pushed somewhere?
author Nick Kossifidis <firstname.lastname@example.org>
Thu, 2 Jun 2011 00:09:48 +0000 (03:09 +0300)
committer John W. Linville <email@example.com>
Fri, 3 Jun 2011 18:19:49 +0000 (14:19 -0400)
It was pushed to wireless-2.6.git and wireless-testing.git but it missed the 188.8.131.52 (too late). I hope it lands in 184.108.40.206 and newer because all of the latest versions (incl 2.6.39 and 3.0) have this problem.
I neglected to add 'Cc: firstname.lastname@example.org' to the commit message, but I sent something to that address today requesting inclusion in 2.6.38.y.
Created attachment 62132 [details]
dmesg showing the network loss at about 266s
I am still having this problem, maybe it's a different bug but I'll report here anyway, let me know if I have to open a new one.
The problem is that I am able to connect to the wireless network (using NetworkManager from Ubuntu 11.04), but after about 210s I lost the connection and I am no longer able to reconnect. I have to wait half an hour then I can reconnect for another ~200s and so on.
I am using 220.127.116.11 with the patch from this bug + this patch http://patches.aircrack-ng.org/ath5k_regdomain_override.patch (to use 30dB power but it makes no difference) + this https://patchwork.kernel.org/patch/103589/ (to avoid the -1 error with airmon). I enabled the following ath5k options:
I have to say I connect to a public AP with WPA which is far from my home (I get about 70~90dB) but it always worked fine up to some time ago. I am attaching my dmesg, note the gap between 53-266 where the network worked, then it fails and try to reconnect without success.
Yes, please file a new bug. This bug has been solved by the patch you have already applied. So you do not experience _this_ bug.
Please file a new bug with an appropriate description. If you can determine a kernel release that does not exhibit your problem, then mark it as a regression and make it block the corresponding tracker bug (see bug #15790).
p.s.: for completeness sake you can post that bug number here, so that any interested party can follow...
Thanks, filed as bug #37612.
(In reply to comment #26)
> I neglected to add 'Cc: email@example.com' to the commit message, but I sent
> something to that address today requesting inclusion in 2.6.38.y.
Sadly, this was left out of the current 18.104.22.168 and I bet it won't be in 22.214.171.124 either.
Commit bdc5ce7ef6b7a4aa7a9ae7c60767783e6c5e438a in 126.96.36.199 and commit a99168eece601d2a79ecfcb968ce226f2f30cf98 upstream. Thanks.